Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] problems with spidering UTF8

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Mar 25 2009 - 06:16:11 GMT
On Tue, Mar 24, 2009 at 01:47:22PM -0400, Michael Peters wrote:
> Michael Peters wrote:
> 
> > Am I right? If so, how can I fix this? When I use swish-e to index a 
> > filesystem with HTML docs that have UTF8 I use a FileFilter that changes 
> > UTF8 chars into HTML entities. Can I do something similar with the spider?
> 
> The answer, for anyone else who comes after me, is "Yes!". It's called 
> output_function and it replaces the normal printing done by the spider with your 
> own function (not quite the same concept as a filter as it makes me copy some of 
> the existing spider code into my sub so it works right). I ended up with a sub 
> like this:
> 
>    use Encode qw(decode_utf8);
>    sub filter_output {
>        my ($server, $content, $uri, $response, $bytecount, $path) = @_;
>        $$content = decode_utf8($$content);
>        $$content =~ s/([^\p{IsASCII}])/sprintf('&#x%X;', ord($1))/ge;
>        my $new_length = length($$content);
>        print "Path-Name: $path\nContent-Length: $new_length\n";
>        print "Charset: $server->{charset}\n" if $server->{charset};
>        print "Last-Mtime: " . $response->last_modified . "\n"
>            if $response->last_modified;

I'm a bit confused by this.

As you know, there are no delimiters between "files" in the -S prog
data stream.  Rather, there's a content length header in bytes.
Swish-e uses that to know how many bytes to read in for that document
-- how may bytes to the next document.

Again, that's bytes, not characters.

The problem is if the content-length is reported in characters
not bytes, then swish would end up reading in the wrong number of
bytes.

One would expect if the number was reported in characters that if
anything the length would be too low.  But you are seeing:

    Warning: Unknown header line: 'th-Name:

which looks like the content-length header was reporting too many
bytes.  That's where I'm a bit confused.



The spider gets the content from LWP with:

        my $content = $response->decoded_content;

So $content is characters at that point.  That's what you want -- you
want characters in your Perl program and octets on the outside.

When spider.pl sends the data to swish it first re-encodes it back
into bytes (octets).[1]  At this point spider.pl really should use
length() to get the length in bytes (as it's no longer characters).

spider.pl uses "$bytecount = length pack 'C0a*', $$content;", which I
think was to deal with different versions of perl, but really seems
like it should just be length() now.


[1] It re-encodes back to the original encoding because the content
might have a meta content-type header that includes a charset and the
charset and encoding should match, of course.


-- 
Bill Moseley.
moseley@hank.org
Sent from my iMutt
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Mar 25 02:16:24 2009