Skip to main content.
home | support | download

Back to List Archive

Re: Indexing cut off - more info

From: David VanHook <dvanhook(at)not-real.mshanken.com>
Date: Tue Apr 29 2003 - 14:38:11 GMT
Thanks very much for the quick reply.

The version we're running is SWISH-E 2.2rc1.

In looking at which files are getting indexed, and which ones aren't, it
appears that the titles for many documents are getting indexed, but not the
bodies.

Which makes me wonder if the way I've got SwishCommand index and noindex set
up is causing the problem.  They're not balanced.  Here's an example

<HTML>
<TITLE>title here</TITLE>
<!-- SwishCommand noindex -->
	Junk HTML up here -- navbar, etc.
	Blah blah blah
<!-- SwishCommand index -->
	Document body goes in here
	Good document body
<!-- SwishCommand noindex -->
more junk code
more junk code
<!-- SwishCommand noindex -->
</BODY>
</HTML>

When I do a search for word X (on these bad indexes), it appears that word X
is only showing up when it appears in the title of the document.

I've had it set up this way from the very beginning, as I recall, but maybe
I'm remembering wrong.  Is it possible that SWISH is "remembering" the
unmatched NOINDEX command from previous documents and is getting confused
somehow?

Thanks.

Dave V.



-----Original Message-----
From: swish-e@sunsite.berkeley.edu
[mailto:swish-e@sunsite.berkeley.edu]On Behalf Of moseley@hank.org
Sent: Tuesday, April 29, 2003 10:03 AM
To: Multiple recipients of list
Subject: [SWISH-E] Re: Indexing cut off - more info


On Tue, Apr 29, 2003 at 06:47:45AM -0700, David VanHook wrote:
>
> Here's a bit more information -- it appears that the logfiles for the
"good"
> indexings and the logfiles for the "bad" indexings are different in one
key
> respect.
>
> The number of files they index is the same: 21,000 files.  But on the bad
> ones, the indexer is finding 26041 unique words, and a total of 535,411
> total words.  On the good ones, the indexer is finding 108,563 unique
words,
> and 5,971,632 total words.
>
> So it's seeing the files, but not indexing them completely.  I've looked
at
> the source code, and the SwishCommand noindex and SwishCommand index tags
> are in the proper spots.  And we've not made any edits to our stopwords
file
> since January.
>
> Any ideas which would cause the spider.pl to look at the files but not
index
> them in this fashion?

Which version are you running?

Those are bid differences in word counts so you should be able to find a
single document to test with.  If not, there's probably a way to find the
bad files with -T and counting the number of words per file.

Then I'd just look at the output from spider.pl and see what's missing.  If
nothing is missing then feed that output into swish and use -T indexed_words
and make sure it's all getting indexed.
Received on Tue Apr 29 14:42:08 2003