Skip to main content.
home | support | download

Back to List Archive

Re: Sorting by swishlastmodified...

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sat Apr 07 2001 - 16:17:38 GMT
At 09:59 AM 04/06/01 -0700, David Wood wrote:
>Bill, I did see your note about the new "prog" stuff and I'm certainly 
>interested, but that new spider is more complex than the previous one, and 
>we have some somewhat weird customisations to the previous one, and I just 
>haven't had the chance to play around with the new one enough yet.

Well, that's kind of feedback I'm looking for.  I didn't think it was more
complex, but it is much more powerful since you can use perl to control the
spider in your config file, which might make your "weird" customization
easier.

Here's an example config file.  Any feedback you can provide to make it
easier would be great!

102) ~/swish-e/prog-bin/test %cat SwishSpiderConfig.pl
@servers = (
    {
        base_url    => 'http://sunsite.berkeley.edu:4444/',
        email       => 'moseley@hank.org',
        delay_min   => .0001,     # Delay in minutes between requests
        test_response => 
           sub { $_[0]->header('content-type') =~ m!^text/html! },
    },
);    

1;

That's it.  Now, run the indexing.

103) ~/swish-e/src/swish-e -S prog -i ../spider.pl -v 1
Indexing Data Source: "External-Program"
Indexing ../spider.pl..
../spider.pl: Reading parameters from 'SwishSpiderConfig.pl'
../spider.pl: http://sunsite.berkeley.edu:4444/
       URLs : 138  (19.7/sec)
   Spidered : 124
    Indexed : 124
  Duplicats : 1535
    Skipped : 1535
   MD5 Dups : 0

Removing very common words...
no words removed.
Writing main index...
Writing header ...
Writing index entries ...
Sorting Words alphabetically
Writing stopwords ...
2100 unique words indexed.
Writing file index...
Writing file list ...
DBG: Starting sorting of properties
DBG: End sorting of properties
Writing file offsets ...
Writing MetaNames ...
Writing Location lookup tables ...
Writing offsets (2)...
124 files indexed.
Running time: 7 seconds.
Indexing done!



>On the other hand, would the patch below fix the 'old' spider?  The idea is 
>that if you get HTTP code 200 back in swishspider then you write the 
>Last-Modified date into the .response file as well, and write it in seconds 
>since epoch format to save the C code having to muck around with date 
>formats, localisation, etc.

> >     print RESP str2time($response->header("last-modified")) . "\n";

I'll look at the code soon.  Of course, not all documents return a
last-modified date, so would need to check for that above.



Bill Moseley
mailto:moseley@hank.org
Received on Sat Apr 7 16:18:49 2001