Skip to main content.
home | support | download

Back to List Archive

Re: HTTP method and swishspider

From: P. Bryan Heidorn <heidorn(at)not-real.alexia.lis.uiuc.edu>
Date: Tue Sep 26 2000 - 02:20:32 GMT
I have been using both files system and http indexing on prior version of
swish but only for relatively small slowly changing collections of about
1500 documents at a time.  Files system indexing runs too quickly to worry
about on these collections. For remote systems the http indexing works fine
if I have the system wake-up at night and do the indexing. I have had
problems when I have asked students to use HTTP indexing on remote sites.
It is too easy to put the assignment off until the last night and then,
since I told them about the delay timer, turn down the delay and try to
download an entire site in a few minutes. That wouldn't be too bad if it
were not for the fact that groups of students tend to fill the labs at the
same time indexing the same site. Bad news for the server and the LAN! So,
inter-document delays are our friends. It would be good in these situations
if the spider did not need to be reloaded into memory with each document
cycle! A C-based integrated spider would help.

I do have a problem with the current spider's inability to span machines
since collections do span machines. I have not figured out how to patch
around it yet in the current implementation. I have not really given it
thought yet.

Bryan
At 09:39 AM 9/25/00 -0700, David Norris wrote:
>jmruiz@boe.es wrote:
>> I have noticed that this option is slow. I am wondering why.
>
>Could that be related to the waiting period between requests?  Some
>sites will automatically block IPs making consecutive requests very
>quickly as a denial of service protection.  There should be some way to
>change the delay in case you are indexing your own server.  For sites
>you don't manage there is an informal agreement among robot writers to
>keep requests below a certian rate.  Robot Guidelines 
><http://info.webcrawler.com/mak/projects/robots/guidelines.html>
>
>> I am wondering if there is a way to avoid the use of  swishspider.
>
>I have looked at a few options in the past.  GNU WGet may be a good
>option.  WGet is a very flexible mirroring robot written in C which
>supports HTTP and FTP.  I attempted to interface WGet in a manner
>compatible with swishspider.  I had trouble extracting URLs from the
>downloaded file with WGet.  WGet doesn't dump a list of URLs although it
>stores one in an internal structure.  I think it would be easier to
>interface WGet into SWISH-E in a manner different than swishspider.
>
>I should have a copy of Mark Gaulin's swishspider written in C.  I don't
>believe he ever publicly released the source.  You might want to email
>him and ask about it, if possible (I'm not sure he's still on this
>list).  If I can find the source I will send you a copy.  It builds and
>runs well on Windows.  It needs some porting to Unix.
>
>-- 
>,David Norris
>  Dave's Web - http://www.webaugur.com/dave/
>  Dave's Weather - http://www.webaugur.com/dave/wx
>  ICQ Universal Internet Number - 412039
>  E-Mail - dave@webaugur.com
>
>"I would never belong to a club that would have me as a member!"
>                                          - Groucho Marx
Received on Tue Sep 26 02:21:01 2000