Skip to main content.
home | support | download

Back to List Archive

Re: using config.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Apr 15 2004 - 20:07:47 GMT
On Thu, Apr 15, 2004 at 11:50:00AM -0700, Lung.Allen wrote:
> 
> File1.conf
> IndexDir /app/swish/lib/swish-e/spider.pl
> SwishProgParameters default http://10.20.172.100/doc/redhat-config-bind-2.0.0/
> IndexFile /var/www/index.file1
> ParserWarnLevel 3
> FileFilter .pdf pdf2html "'%p' -"

Since you are using spider.pl with "default" as the first parameter it
should automatically do that pdf conversion so you don't need the
FileFilter line.  

You can look at spider.pl and look at the default_urls() function.  It
will load SWISH::Filter and pass documents to that module.
SWISH::Filter will look for pdftotext and if found will filter any file
that has a content type of application/pdf.

Oh, also use HTML* everywhere -- including your DefaultContents line.

> My next step is to use is to use swishspider.conf like this 'swish-e
> -S prog -c swishspider.conf' The contents follow:

> I then created a config.pl, the contents follow:
 
>         my %serverA = (
>                 base_url        => 'http://10.201.12.64/',
>                 email           => 'allen.lung@ftb.ca.gov',
>     debug           => DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED,
> #               link_tags       => [qw/ a frame /],
> #               test_url        => \&foo,
>         );
>         my %serverB = (
>                 base_url        => 'http://10.20.172.100/doc/redhat-config-bind-2.0.0/',
>                 email           => 'allen.lung@ftb.ca.gov',
> #               link_tags       => [qw/ a frame /],
> #               test_url        => \&foo,
>         );
> @servers = ( \%serverA, \%serverB, );
> 
> #               test_url        => sub {
> #                       my $uri->path =~ /\. (gif|jpeg|png|doc|pdf)$/;
> #                       return 1;
> #               },
> 
> Is this the proper way to use the config.pl?

Well, kind of.  You have your  test_url commented out so it's going to
try an index everything it can find.

The idea is you use test_url to test by file name.  test_url callback is
called before fetching the document (as they are extracted from links on
each page).  So that's a good place to exclude obvious files based on
their file name.  You can use the "test_response" callback to test the
content type returned when fetching the document.  Currently this
callback is called during the first chunk of data returned from the
server.  I may change that to do a HEAD request first because aborting
on a GET request breaks the keep-alive connection.

> This is actually attempting to index .pdf and .doc files I do want to
> index .pdf, .doc and many others.  The first files I want to index
> beyond what I'm doing now is .pdf!  I hope I'm making sense here.  I
> started this process with the code that has the #.  Is this the proper
> location to do the callback subroutines?

Yes, that's that's right.  You can look at spider.pl to see how to use
SWISH::Filter -- the second example in the SwishSpiderConfig.pl example
file also shows how to filter content.

The idea there is you pass the document and it's content type to the
SWISH::Filter module, it tries to find the required filters needed to
filter the document, converts it and returns a new document and a new
content type (like text/html from PDF) that swish-e can index.

Take a look at SwishSpiderConfig.pl and see if it makes sense.  Post
back if you have questions.



-- 
Bill Moseley
moseley@hank.org
Received on Thu Apr 15 13:07:47 2004