Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] problems with spidering UTF8

From: Michael Peters <mpeters(at)>
Date: Tue Mar 24 2009 - 17:47:22 GMT
Michael Peters wrote:

> Am I right? If so, how can I fix this? When I use swish-e to index a 
> filesystem with HTML docs that have UTF8 I use a FileFilter that changes 
> UTF8 chars into HTML entities. Can I do something similar with the spider?

The answer, for anyone else who comes after me, is "Yes!". It's called 
output_function and it replaces the normal printing done by the spider with your 
own function (not quite the same concept as a filter as it makes me copy some of 
the existing spider code into my sub so it works right). I ended up with a sub 
like this:

   use Encode qw(decode_utf8);
   sub filter_output {
       my ($server, $content, $uri, $response, $bytecount, $path) = @_;
       $$content = decode_utf8($$content);
       $$content =~ s/([^\p{IsASCII}])/sprintf('&#x%X;', ord($1))/ge;
       my $new_length = length($$content);
       print "Path-Name: $path\nContent-Length: $new_length\n";
       print "Charset: $server->{charset}\n" if $server->{charset};
       print "Last-Mtime: " . $response->last_modified . "\n"
           if $response->last_modified;

       # Set the parser type if specified by filtering
       if ( my $type = delete $server->{parser_type} ) {
           print "Document-Type: $type\n";
       } elsif ( $response->content_type =~ m!^text/(html|xml|plain)! ) {
           $type = $1 eq 'plain' ? 'txt' : $1;
           print "Document-Type: $type*\n";
       print "No-Contents: 1\n" if $server->{no_contents};
       print $$content;

That seems to do everything I want it to.

Michael Peters
Plus Three, LP

Users mailing list
Received on Tue Mar 24 13:49:49 2009