Skip to main content.
home | support | download

Back to List Archive

Re: Indexing of word documents, stored on a UNIX

From: FISHER,JOSEPH (Non-HP-Roseville,ex1) <joseph_fisher(at)not-real.non.hp.com>
Date: Fri Aug 17 2001 - 21:57:18 GMT
Hi Bill,

Ok, I understand that I need to include a filter file in order to index the
contents of MS Word documents stored on a Unix system... (As I understand
it, this was NOT necessary under SWISH 1.3...)

I've downloaded and compiled "catdoc"... Catdoc is even referenced in one of
the filter files under SWISH-E 2.1...

	.../filter-bin/_doc2text.sh

I've installed it in it's default location, and made sure that the filter
file is pointing to the correct directory structure...

But which configuration file should I modify to make SWISH-E sees this MS
Word filter file?

Thanks in advance,

Joe Fisher

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Friday, August 17, 2001 12:04
To: Multiple recipients of list
Subject: [SWISH-E] Re: Indexing of word documents, stored on a UNIX


At 11:31 AM 08/17/01 -0700, FISHER,JOSEPH (Non-HP-Roseville,ex1) wrote:
>When I index the documents, everything appears to go through just fine,
with
>the following exceptions:
>
>	1) I get a warning message for each file being indexed:
>
>		Warning: Possible embedded null in file
>'/case_cr_rpts/docs/dataload/xml_spec3.doc'

Well, without seeing your config, I don't know.  To index Word documents you
need to use a filter (or add filtering to your program if indexing with -S
prog).

http://sunsite.berkeley.edu/SWISH-E/2.2/docs/SWISH-CONFIG.html#Document_Filt
er_Directives

Don't use a shell or perl script to call catdoc -- rather call catdoc
directly as shown in the example.   The scripts will kill your indexing
speed.




Bill Moseley
mailto:moseley@hank.org
Received on Fri Aug 17 22:22:55 2001