Skip to main content.
home | support | download

Back to List Archive

Re: last modified date in swish-e index file

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Jun 01 2001 - 04:18:45 GMT
At 06:21 PM 05/31/01 -0700, Steve McMillen wrote:
>In 1.3 and 2.0.5 I was able to index a .doc or .xls file even though it
>was binary.  Now when I index these types of files I get the following
>error:
>
>Warning: Possible embedded null in file '/www/html/.../SurveyQues

That was added to catch nulls in files where they were not expected (and
thus not indexing the entire file).  It basically means that the file is
longer than the content that will be indexed, so it's a good warning.  It
was added after a report of swish not indexing a file properly and it
turned out to have an embedded null.

Swish doesn't index binary files.  I think what you want to do is use a
filter and the strings(1) program to extract string out of the binary files.

If you are indexing .doc (Word) files, then there's the catdoc program to
extract out the text.  I believe I saw a utility to extract out xls files,
but I don't remember anything specific.  

>I notice that in 2.1 there is a Document Filter Directive to preprocess
>a file in an external program.  Is this now required (and what I am
>seeing above is expected behavior) or should I still be able to index
>binary files (and thus, should I report my issue above as a bug?).

Report it as a bug fixed ;)

>I should add that is is very convenient that swish 2.0.5 just indexed
>the files 'as-is'.

But the problem is that it wasn't indexing the files.  If there was a null
the indexing stopped for that file, so you really were not indexing the
files as-is.  Only up to the first null.   

>If I do have to use the Document Filter, can i just
>run the files thru the UNIX strings command?  All I really care about is
>the strings inside the file anyays.

Yep.  Use strings.  Do you have a lot of files like this?  Take a look at
the filter docs included with 2.1.  But also try to find programs that will
actually extract from the file types you are using.


Bill Moseley
mailto:moseley@hank.org
Received on Fri Jun 1 04:20:01 2001