Skip to main content.
home | support | download

Back to List Archive

Re: RE: LWP,HTTP and HTML modules

From: Ron Klatchko <ron(at)>
Date: Tue Jan 19 1999 - 19:32:25 GMT
At 03:22 PM 1/16/99 -0800, Yann Stettler wrote:
>(I noticed that it seems to even hang on very large
>binary files even when going through the loopback interface
>of the server... Probably all the pattern matching to try finding
>URLs among several MB of binary... especialy if there isn't
>many newlines... regexp is fine but pretty slow under those

I'm not sure why it is hanging, but it's not due to looking for URLs in
binary files.  If you check swishspider, you'll see it only does the check
for URLs in file with a mime type of text/html.

>SWISH makes a test on the mime-type of the files and only index
>those of "text/*" type. An obvious things to do would be to
>first use a HEAD request on the document and test the mime-type
>in the spider to avoid transfering those of the wrong type.
>Naturaly, that would mean doing two requests (first the HEAD
>and then the GET) for all documents of the correct types.
>Another way, but for that you have to use your own functions
>instead of using LWP, would be to directly do a GET request
>but abort the transfer after reading the file header if it's
>of the wrong type...

I would recommend either using the latter option or making it configurable.
 In some test with a link checker I wrote, I discovered that there are
still quite a few servers that choke on a HEAD request.

>And if you realy want to optimize things, you should implement
>back the "NoContents" directive into SWISH for the HTTP method.
>(why was it done only for the file system method anyway ???).
>That would avoid forking a process and runing a PERL program
>just to realize that the document shouldn't be indexed after

I completely disagree with that.  The only way to prevent a request is to
use the file extension and I believe that the HTTP method should rely
solely on the mime type as this is the definitive statement as to the type
of the file.  Even if it works differently from Swish, I think it makes
more sense to use positive logic (only index files with certain mime types)
as opposed to the file systems negative logic (index everything except
files with certain extensions).

As for avoiding the fork/exec overhead, I'd love to see the perl helper
script swishspider rewritten in C and pulled into the actual swish
executable.  The only reason I wrote it in Perl in the first place was to
reduce the amount of time I spent implementing the first version.  Does
anyone know of an HTTP library that we could integrate with Swish?  It
should meet the following requirements:

1) Have a license agreement that is in line with Swish's license agreement.
2) Works with most Unices and Win32.
3) Mature enough version that the interface won't be constantly changing.

I'll make this offer.  If someone is willing to do the research and find an
HTTP library that meets those requirements, I'll do the actual integration
work to drop the use of swishspider.  Any takers?


          Ron Klatchko - Manager, Advanced Technology Group           
           UCSF Library and Center for Knowledge Management           
Received on Tue Jan 19 11:27:56 1999