Skip to main content.
home | support | download

Back to List Archive

Re: Indexing cut off - more info

From: <moseley(at)not-real.hank.org>
Date: Tue Apr 29 2003 - 14:03:09 GMT
On Tue, Apr 29, 2003 at 06:47:45AM -0700, David VanHook wrote:
> 
> Here's a bit more information -- it appears that the logfiles for the "good"
> indexings and the logfiles for the "bad" indexings are different in one key
> respect.
> 
> The number of files they index is the same: 21,000 files.  But on the bad
> ones, the indexer is finding 26041 unique words, and a total of 535,411
> total words.  On the good ones, the indexer is finding 108,563 unique words,
> and 5,971,632 total words.
> 
> So it's seeing the files, but not indexing them completely.  I've looked at
> the source code, and the SwishCommand noindex and SwishCommand index tags
> are in the proper spots.  And we've not made any edits to our stopwords file
> since January.
> 
> Any ideas which would cause the spider.pl to look at the files but not index
> them in this fashion?

Which version are you running?

Those are bid differences in word counts so you should be able to find a 
single document to test with.  If not, there's probably a way to find the 
bad files with -T and counting the number of words per file.

Then I'd just look at the output from spider.pl and see what's missing.  If 
nothing is missing then feed that output into swish and use -T indexed_words 
and make sure it's all getting indexed.
Received on Tue Apr 29 14:03:44 2003