Skip to main content.
home | support | download

Back to List Archive

error indexing pdf files

From: Jody Cleveland <Cleveland(at)not-real.mail.winnefox.org>
Date: Tue Apr 15 2003 - 19:03:18 GMT
> By the way, 
> 
>   $URI::ABS_REMOTE_LEADING_DOTS = 1;
> 
> at the top of my spider config file does seem to fix it.  

I put that up there, and I still get .., but it doesn't look like I get as
many. I ran the site through the w3c link checker, and there are broken
links all over the place on that site.

I spoke with the person in charge of the site, and she wants a seperate
search page for each directory. So, it looks like for a few of the
directories, I'll need a seperate swish for. Also, she didn't seem willing
to change all the pages to have a metatag, so it looks like swish-e will
have to do all the work.

Here's the main chunk of my SwishSpider.pl:
@servers = (

        skip        => 0,  # skip spidering this server
        debug       => DEBUG_URL,  # print some debugging info to STDERR

        base_url        => 'http://www.oshkoshpubliclibrary.org/citydirs/',
        email           => 'swish@domain.invalid',
        delay_min       => .0001,
        link_tags       => [qw/ a frame /],
        max_files       => 2500,
        max_indexed     => 2500,        # Max number of files to send to
swish for indexing

        max_size        => 100_000_000_000,  # limit to 1MB file size
        max_depth       => 10,         # spider only ten levels deep
        keep_alive      => 1,

        test_url        => sub {
		my $uri = shift;
		return if $uri->path =~ /\.(gif|jpeg)$/;
		return $uri->path =~ m[^/citydirs/];
		 },

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } qw{ text/html text/plain
application/pdf application/msword };

            # This might be used if you only wanted to index PDF files, yet
spider still spider.
            #$_[1]->{no_index} = $content_type ne 'application/pdf';

            return 1 if $ok;
            print STDERR "$_[0] wrong content type ( $content_type )\n";
            return;
        },

        filter_content  => [ \&pdf, \&doc ],
    },

As it is now, it searches all of www.oshkoshpubliclibrary.org. Now, I've got
/citydirs/ in the test_url part. For the base_url, should that stop at .org,
or should it also contain /citydirs/? And, am I missing something else, or
have a typo? Either way I do it, it still indexes everything at the site.

Jody
Received on Tue Apr 15 19:07:05 2003