Skip to main content.
home | support | download

Back to List Archive

RE: http method v. file system method

From: Steve van der Burg <steve.vanderburg(at)not-real.lhsc.on.ca>
Date: Fri Feb 18 2000 - 14:15:06 GMT
>>From:	Michel Verdier [SMTP:mverdier@chez.com] 
>>Sent:	Thursday, February 17, 2000 8:18 PM
>>To:	Multiple recipients of list
>>Subject:	[SWISH-E] RE: http method v. file system method
>>
>>Chris Humphries <ChrisJMH@vermilion99.freeserve.co.uk> a ecrit :
>>
>>| I don't find the indexing too slow - I know how slow the Internet can 
>>be.
>>
>>But you forgot the http server calls. If you put away the net response
>>time, file access cost file access while http access cost server call +
>>file access + server response. So http method is always slower even on the
>>same machine.
>>
>>--
>>mverdier@chez.com (Michel Verdier)
>
>Good point!
>
>I have to confess that I have not used Swish-E for more than a few hundred 
>documents so far. I imagine that with a few thousand documents, the 
>indexing time differential between file and http methods is really 
>noticable.
>
>Chris Humphries
>

I spider my site with HTTP, because the layout of my document tree in the filesystem is a bit convoluted.  The performance difference is partly due to the startup costs of the spider - swish launches a new spider process to fetch and parse each document.  I have patched both swish-e and the spider here to keep the spider persistent.  Doing it this way cut the time to index my site (done nightly) in half, and cut the system load considerably during that time.  I'd be happy to provide the patches if anyone wants them (I've been meaning to anyway, but haven't gotten around to packaging it up yet).
What motivated me to change the "swish launches spider; spider fetches page; spider dies; control returns to swish" model was that the spider, at my site, kept getting fatter;  that is, I kept adding things to it, making the startup penalty for each invocation higher and higher (my spider uses lots of perl modules, writes stuff to databases, etc), which made the indexing process take longer and longer.

To see the results of the extra tasks that my version of the spider does, go here
   http://www.lhsc.on.ca/swish-e/
Since I wrote the document at that URL, I've added a few more features and changed the spidering model (as mentioned above).  I've added a "list links to this page" feature to the search output (go to http://www.lhsc.on.ca/cgibin/search and type in a search term ("hospital" will get a result list) to see this in action).

..Steve


-- 
Steve van der Burg
Information Services
London Health Sciences Centre
(519) 685-8300 ext 35559
steve.vanderburg@lhsc.on.ca
Received on Fri Feb 18 09:19:02 2000