Skip to main content.
home | support | download

Back to List Archive

Getting the right files indexed the right way

From: Rob de Santos AFANA <rdesantos(at)not-real.afana.com>
Date: Wed Jan 28 2004 - 07:31:22 GMT
Hi, 

Now that I have swish-e installed and mostly working, I have run into a
slight problem, more than likely due to my own lack of understanding.  I
want swish-e to index most of the files at my site, use the filters
where necessary, and for certain types of files, namely graphics (e.g.
.gif .jpg etc.) to only index the file path/name.  So my config files
are set up this way (edited for brevity):  

in my swish-e config file:
-----
IndexDir spider.pl

NoContents .gif .jpg .png .cgi .pl .log .jar .ico .js .class .log .sql
.csv .dir .idx .dat
IndexContents HTML* .htm .html .shtm .shtml .css
IndexContents TXT* .txt .text
IndexContents XML* .xml .wml .rdf .rss
DefaultContents HTML

SwishProgParameters
/home/afana/public_html/swish-e/lib/swish-e/SwishSpiderConfig.pl
http://www.afana.com http://www.afana.com/blog/archives/
http://www.afana.com/album/ http://www.afana.com/webbbs/bbs1/
http://www.afana.com/webbbs/bbs0
-----

in my spider.pl config file I have this:
-----
        filter_content  => \&filter_content,        

sub filter_content {
    my ( $uri, $server, $response, $content_ref ) = @_;

    # Uncomment this to enable debugging of SWISH::Filter
    # $ENV{FILTER_DEBUG} = 1;

    my $content_type = $response->content_type;
    my $uri_ext = $uri->path;
    
    # Ignore text/* content type -- no need to filter
    return 1 if !$content_type || (($content_type =~ m!^text/!) ||
($uri_ext =~ /\.(gif|jpg|jpeg|png)?$/));
    
    # Load the module - returns FALSE if cannot load module.
    unless ( $filter ) {
        eval { require SWISH::Filter };
        if ( $@ ) {
            $server->{abort} = $@;
            return;
        }
        $filter = SWISH::Filter->new;
        unless ( $filter ) {
            $server->{abort} = "Failed to create filter object";
            return;
        }
    }

    # If not filtered return false and doc will be ignored (not indexed)
    
    my $doc = $filter->convert(
        document => $content_ref,
        name     => $response->base,
        content_type => $content_type,
    );
    return unless $doc;
    # return unless $doc->was_filtered # could do this since checking
for text/* above
    return if $doc->is_binary;

    $$content_ref = ${$doc->fetch_doc};

    # let's see if we can set the parser.
    $server->{parser_type} = $doc->swish_parser_type || '';

    return 1;
}
-----

When the indexing runs, swish-e attempts to read and interpret the jpeg
files rather than simply adding the file path and name to the index as
indicated in the NoContent directive.  

So what I am doing wrong?  (Or at least... where do I start? :-) )

Regards, 

-Rob de Santos
-Columbus, Ohio USA
Chairman of the Board,
Australian Football Association of North America (AFANA)
ph: 1-888-4AFANA1 (North America) (1-888-423-2621)
ph: 1-614-338-0002 (outside NA)  
e-mail: rdesantos(at)not-real.afana.com   web: <http://www.afana.com>
Received on Tue Jan 27 23:31:23 2004