I recently discovered the GNU wget utility. It seems very
robust and like it can crawl through a remote web site in any
way you could think of.
Given its existence and my general loathing of reinventing the
wheel, it seems fairly easy to make SWISH++ index remote web
sites using it by providing a simple "glue" script wget2index:
#! /usr/local/bin/perl
while ( <> ) {
print "$1\n" if /-> "([^"]+)"/;
}
Given that, you can now do:
wget -rxnv -linf -A txt,html -X/cgi-bin \
http://www.other-site.com 2>&1 | wget2index | index -
to copy a remote site to a local filesystem that 'index' can
index. Your Perl CGI script that calls search could have to
know to take the first directory name in an index and make that
the hostname.
If local filesystem space is an issue, i.e., you don't want to
copy an entire other web site to your local filesystem as you
index it, I'm sure it would be possible to write a slightly
more complicated Perl script that would delete the files after
they are indexed as the get/index cycle progresses. You'd
probably en up doing something using the IPC::Open2 Perl module
(see the Perl 5 "Camel" book, p. 344): open a bidirectional
pipe to index with the -v3 option so the script could tell when
file has been indexed so the file could be deleted safely.
- Paul
Received on Mon Dec 28 16:04:44 1998