Focused Spidering - Multiple Hosts

From: Shay Lawless <seamuslawless(at)>
Date: Tue Feb 28 2006 - 11:49:18 GMT
Hi All,

I am a newcomer to the list, I have searched the archive in an attempt to
answer my query before posting and have not been successful, but apologies
if this is something that has already been discussed and resolved.

I am a PhD student in Trinity College Dublin. My research involves the use
of open corpus content in elearning, i.e. using freely available learning
content, from both the www and digital libraries, to generate online
learning offerings / courses, personalised to individuals needs. As part of
this I need to implement a focused web crawler / spider to create an index
of sourced learning content on the www, that can then be searched. This is
where swish-e comes in!

Having trawled through the multiple indexer / crawler / spider technologies
out there, the fact the swish-e indexes web pages as well as supporting
searching by meta tags etc makes it a pretty good match to what I require.
However, having read the swish-e documentation I see that the is
not designed to spider across offsite links or multiple hosts. I realise
that by adding to the @servers array it is possible to spider multiple
websites, however in my case the sites required to be crawled will only be
discovered as the crawl progresses.

Has anyone out there configured swish-e to perform a focused web-crawl
without the provision of all the host machine names upfront? Is it even
possible for this to happen within the swish-e functionality?

Any help you can provide will be greatly appreciated, thanks in advance,


Received on Tue Feb 28 03:49:22 2006