Skip to main content.
home | support | download

Back to List Archive

Re: RE: LWP,HTTP and HTML modules

From: Yann Stettler <stettler(at)>
Date: Sat Jan 16 1999 - 23:29:44 GMT
David Norris wrote:

> Someone else asked this about a month ago.  I did some checking then.  Did
> anyone ever rewrite the PERL script for the spider?  The HTML module has
> been discontinued, and, is not available in current releases of PERL.  In my

As I already said, there is another reason to rewrite the
spider program : No check at all is done on the content of the
files requested by HTTP. That's mean that the spider will
download _fully_ pictures, movies, sound files and other binaries
that will be discarded right after by SWISH... Speak of a wasting
of bandwidth... not counting the memory used in the case of large
files... (I noticed that it seems to even hang on very large
binary files even when going through the loopback interface
of the server... Probably all the pattern matching to try finding
URLs among several MB of binary... especialy if there isn't
many newlines... regexp is fine but pretty slow under those

SWISH makes a test on the mime-type of the files and only index
those of "text/*" type. An obvious things to do would be to
first use a HEAD request on the document and test the mime-type
in the spider to avoid transfering those of the wrong type.
Naturaly, that would mean doing two requests (first the HEAD
and then the GET) for all documents of the correct types.
Another way, but for that you have to use your own functions
instead of using LWP, would be to directly do a GET request
but abort the transfer after reading the file header if it's
of the wrong type...

And if you realy want to optimize things, you should implement
back the "NoContents" directive into SWISH for the HTTP method.
(why was it done only for the file system method anyway ???).
That would avoid forking a process and runing a PERL program
just to realize that the document shouldn't be indexed after
I think that I posted the "http.c" file that contain the
change to use "NoContents" on the mailing-list or did
forgot ?

Yann Stettler

TheNet - Internet Services AG              CohProg SaRL                           
Anime and Manga Services         
Received on Sat Jan 16 15:21:25 1999