Re: NoContents

From: Bill Moseley <moseley(at)>
Date: Thu Feb 24 2005 - 17:34:13 GMT
On Thu, Feb 24, 2005 at 09:09:40AM -0800, Thomas Angst wrote:
> Hi List,
> I've a strange problem. I believe swish-e is opening and factoring all 
> Files no matter whether 'NoContents' is set or not.

Well, yes, maybe.  HTML parser reads in the entire file. HTML* parser
reads the file and aborts after *swish* gets the <title> tag.  libxml2
may actually end up reading much more of the file.

The "total bytes" is just from the stat() of the file (or the
"content-length" with -S prog).  It may not be actually what is read.

NoContents happens after fetching the file, IIRC.

> You can see the 2 outputs with and without 'NoContents'. With there are 
> only 21 words index , without 2236. That's correct, but watch the time, 
> there is no difference. It looks like swich-e processing anyway all 
> files but doesn't save the found words.

You are assuming that all the work is done just in indexing -- most is
in I/O and parsing, any using NoContents may not effect that much.

> Can anybody tell me how I can speedup this without skipping the indexed 
> filenames for images?

You might try use -S prog and adjust what gets sent to swish for files
you don't want the content indexed.  For example, if you only want to
index image file names then don't fetch the image, but print out a
document to swish that just contains the file name.

Bill Moseley

Received on Thu Feb 24 09:34:14 2005