Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] partial indexing

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sat Mar 28 2009 - 14:22:19 GMT
Zhou Xiang wrote on 3/27/09 3:12 PM:
> Thank you for your help!
> Now it is still strange that when I tried to index the following page:
> http://digital.lib.lehigh.edu/beyondsteel_test/admin/templist.htm
> Although I set max_depth to be 1, it still cannot dig deeper into each link.
> That means it can only index the text appears on the above page, but none of
> the contents in each link, .
> Can you figure it out?
> 
> My spider.config file:
> @servers = (
> {
>   base_url    => '
> http://digital.lib.lehigh.edu/beyondsteel_test/admin/templist.htm',
>   email       => 'abc@gmail.com',
> 
>   # other spider settings described below
>   max_depth   => 1,
> },
> );

Did you read what I said before?

>>
>> Read the docs:
>>
>>  http://swish-e.org/docs/spider.html#configuration_options
>>
>> the default behaviour is to remain only on the same host.

All the links on the url you supply point at:

 rust.cc.lehigh.edu

which is not the same as

 digital.lib.lehigh.edu

so the spider stops because it will not leave the host you point it at. That's a
feature.

Why not pass in a list of all the urls you want spidered directly?

 base_url => [qw(
   http://rust.cc.lehigh.edu/beyondsteel/display.php?args=start-x_id-Sholes143line2
http://rust.cc.lehigh.edu/beyondsteel/display.php?args=start-x_id-Sholes143line3
http://rust.cc.lehigh.edu/beyondsteel/display.php?args=start-x_id-Sholes143line4

)]

etc.


-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Sat Mar 28 10:22:26 2009