Skip to main content.
home | support | download

Back to List Archive

Re: Focused Spidering - Multiple Hosts

From: Gregory J. L. Tourte <g.tourte(at)not-real.ukoln.ac.uk>
Date: Wed Mar 01 2006 - 17:54:23 GMT
There used to be a project in the UK called WSE (Web Search Environments) which was used to enhance
record search in academic searches. it used to be found at wse.search.ac.uk but the page has been
removed. you can still find some note at http://wse.search.ac.uk/demo.html.

the basic point was to spider a finite number of links down after the main page and index them
slightly differently so you could tell whether it was from the original site or a link further down.

ILRT in bristol I think was leading this project, you could always get in touch with them.

adding this capability to swish-e would be an very interesting feature, I have to say, provided you
can limit the number of level down which the crawler can go. If I remember the report correctly,
they found that the relevancy dropped rather drastically after 2 hoops.

Hope this helps

	Greg Tourte

Bill Moseley wrote:
> On Tue, Feb 28, 2006 at 03:48:19AM -0800, Shay Lawless wrote:
>> Having trawled through the multiple indexer / crawler / spider technologies
>> out there, the fact the swish-e indexes web pages as well as supporting
>> searching by meta tags etc makes it a pretty good match to what I require.
>> However, having read the swish-e documentation I see that the spider.pl is
>> not designed to spider across offsite links or multiple hosts. I realise
>> that by adding to the @servers array it is possible to spider multiple
>> websites, however in my case the sites required to be crawled will only be
>> discovered as the crawl progresses.
> 
> Are you talking about an intranet or the Internet?
> 
> I suspect if you plan on finishing your PhD in the next decade or so
> you might need to find a faster way to spider.   It would be trivial
> to make the spider ignore the host name, but it would be too slow
> running a single process on a single machine to ever finish.  You
> would likely need hundreds, if not thousands, of machines widely
> distributed to spider the entire Internet.
> 
> Could you use an existing index, such as Google, to find the
> documents you want indexed?
> 

-- 

Greg Tourte                                  Systems Administrator/Developer
UKOLN
University of Bath                           tel: +44 (0)1225 384709
Bath, BA2 7AY, UK                            fax: +44 (0)1225 386838
http://www.ukoln.ac.uk/
Received on Wed Mar 1 09:54:30 2006