Re: ignoring words inside form elements

From: Bill Moseley <moseley(at)>
Date: Thu Apr 12 2001 - 04:37:55 GMT
By the way,

If you do decide to parse and filter the html, here's some perl code you
can plug into the example's configuration (
to strip out the <select> tags and their contents.  This is only available
in the development version of swish, of course.

If you are not spidering a web server, then you can use $tree->parse_file()
instead and use the example in the prog-bin directory for ideas.

I didn't run any long tests, but it did seem to add a bit of time to the
indexing (50% more?).  Parsing HTML isn't fast.  Maybe a regular expression
would be faster?  Of course, if you know what files have select tags then
you can just parse those.

Anyway, add this someplace in (found in prog-bin):

use HTML::TreeBuilder;
sub no_select {
    my ( $uri, $server, $response, $content_ref ) = @_;

    # Only deal with html pages
    return 1 unless $response->content_type eq 'text/html';

    my $tree = HTML::TreeBuilder->new;
    $tree->store_comments(1); # index comments?
    $tree->parse( $$content_ref );
    $_->delete for $tree->find_by_tag_name('select');
    $$content_ref = $tree->as_HTML;
    return 1;

Then in modify the parameters in the hash to do
something like this:

   filter_content  => [ \&pdf, \&doc, \&no_select ],

Which calls the three filters for each document.

Disclaimer: I didn't test very much...

Bill Moseley
Received on Thu Apr 12 04:39:47 2001