Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:210] Indexing of MS word documents

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Tue Mar 24 1998 - 01:26:10 GMT
On Mon, 23 Mar 1998, Dean Robson wrote:

> I have downloaded swish-e and autoswish with the intention to look at its
> suitability to index our internal documents.  Typically these documents are
> in MS word format and MS Excel.
> 
> Can swish handle this?

	Not really.

> Alternatively, are there other sun based indexers available?

	Yes, SWISH++ (see the link to it via the SWISH-E home page).
	This is one of the primary reasons I wrote SWISH++.  SWISH++
	does not index MS documents directly; rather it includes a
	utility to extract the raw text out of such documents, e.g.:

		my.doc -> my.doc.txt
		your.xls -> your.xls.txt

	You then index only the *.txt documents.  The Perl-CGI/web
	interface knows to recognize a file having a double file
	extension and substitutes the correct filename on the fly.

	The text extraction isn't perfect -- it can't be without an
	understaning of English (or other native human languages) and
	words in a dictionary.  But it errs on the conservative side
	and extracts gibberish words (sequence of binary data inside
	the MS file that just so happen to also be ASCII, e.g.,
	"BXZPH") and such "words" are also indexed; however, since
	nobody will ever search on such a word (presumeably), all it
	does it bloat the size of the index file.

	Sometimes a file is mostly gibberish.  This results in mountains
	of data being thrown at the indexing engine.  SWISH-E was
	crushed under the immense weight and running out of memory;
	SWISH++ indexes moutains of data just fine.  And this result,
	i.e., of being able to index such documents vs. not being able
	to index them is preferable.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Mon Mar 23 17:34:48 1998