Re: Question on indexing time

From: Bill Moseley <moseley(at)>
Date: Thu Aug 02 2001 - 22:57:09 GMT
At 03:40 PM 08/02/01 -0700, Rick McGowan wrote:
>Bill, thanks again.  I updated to today's snapshot of SWISH-E & re-ran as  
>you suggested. Now instead of running for 48 hours, it seg faults within 2  
>minutes. ;-)
>Along the way, it gives me warnings about possible embedded nulls in some  
>.txt files... Is that something to worry about?

That's my fault.  I added that warning.  That was a result of someone
(Chris?) that was indexing html files, but couldn't find the document in a
search.  It turned out that the document had an embedded null, which meant
everything after the null was not indexed.  So I added that warning --
which I think happens when the buffer holding the file is not the same as
the size of the file.

So, yes, it might be something to worry about.  I'd be interested to see if
the file really does have an embedded null.  You can index just that file
with -i foo.txt and then try to search for words at the end of the file.

>I finally got it to work, however, after doing two things: I had originally  
>set it for "IgnoreLimit 20 1000", but it gave me a warning, so I removed
>and enabled "IgnoreWords SwishDefault"; now it tells me that's obsolete.

Cool!  I'm not sure I'd call IgnoreLimit depreciated, but I'm not so sure
how much it's appreciated....  The problem started when swish could index
phrases.  But with IgnoreLimit you are telling swish to go back and remove
words after indexing.  That's a lot of work since it must adjust word
positions where words are removed.  Just use IgnoreWords and you will be

>I also turned off indexing of ".txt" files, since that might have been the  
>cause of the seg fault...

We should probably track down the segfault.  Do you know how to get a
backtrace with gdb?

>It completed the task and produced an index (of HTML files only), which I  
>could use with  Works great.

In less than two days?

Bill Moseley
