Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] The old encoding/length problem with spider.pl

From: Matthew \ <cheetah-swishe(at)not-real.fastcat.org>
Date: Tue Sep 04 2007 - 16:08:11 GMT
On Sun, 2 Sep 2007, Bill Moseley wrote:

> On Wed, Aug 29, 2007 at 04:15:04PM -0400, Matthew Cheetah Gabeler-Lee wrote:
> > [ swish-e 2.4.5, debian linux package ]
> > 
> > I've recently run across a problem with text encodings and spider.pl 
> > that seems to have been resurfacing occaisonally for a good 5-6 years, 
> > and I think I may have a suggestion to help.
> 
> I've been on vacation so haven't had time to look at this in detail.
> 
> The spider needs to decoded the fetched content (LWP does this,
> actually), and then work with it as a perl string.  What would be easy
> is to then just encode to utf8 and send to swish-e (for libxml2 to
> parse).  And bytes::length() should give the length in bytes for swish
> to read in.
> 
> IIRC, the problem is that there might be a charset in a <meta> tag
> that would indicate to libxml2 that it was not utf8 encoding.  So, I'd
> need to look at that again.

Gotcha.  I figured something like that was at play with the re-encoding 
code I saw nearby.

> Ah, I'm not sure I looked at what the layer might be for STDOUT.  I
> guess I assumed that was not an issue with a pipe.  Again, something I
> need to look at in more detail.

I'd presume that pipes are not going to have an encoding layer unless 
that is explicitly assigned to them.  I think the issue is that if you 
want to output encoded text, one either has to assign an encoding layer 
to the pipe, or output the bytes version of the string.

While switching encoding layers on the fly on a file handle I believe is 
supported, it probably would be simpler, given the whole re-encoding 
thing, to always output the encoded bytes.  If one did that, I think it 
wouldn't be necessary to fall back on pushing things into utf8 for 
transmission to libxml2.

> Can you set up any test cases I could try?

I believe this test file (containing every char from \x01 to \xFF) 
should work as a test case for the particular problem I hit:

http://fastcat.org/tmp/chars.txt

-- 
	-Cheetah
"Reality is that which, when you stop believing in it, doesn't go away".
                -- Philip K. Dick
GPG pubkey fingerprint: A57F B354 FD30 A502 795B 9637 3EF1 3F22 A85E 2AD1
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Sep 4 12:08:16 2007