Skip to main content.
home | support | download

Back to List Archive

Re: Ignore Question

From: Bill Moseley <moseley(at)>
Date: Fri Feb 28 2003 - 17:15:07 GMT
On Fri, 28 Feb 2003, Gentile, Jeff wrote:

> --->use bytes;
>     my $size = length $txt;
> --->no bytes;

> Notice the two lines I pointed to; I looked into the "length" function, 
> and it's a "known" issue that even though it says it reports "bytes" it
> reports characters, however "use bytes" is supposed to fix that. I've got
> a pdf that is 787,244 bytes in size before conversion. The text file size
> of the output is 208,786. The content length (either way... not sure if
> "use bytes" works properly... still in experimental phase) returns as
> 208,573.

In other words if you have a scalar

    $size = length $txt;

it reports 208,573, but if you write out the file the file system reports
it as 208,786 bytes.  Right?

And that happens regardless of use bytes?  

Assuming you are not on Windows (where writing the file would replace \n
with \r\n) then that would make me think you are using a multi-byte
encoding and then use bytes is not working.

Clearly, there needs to be a way to get the length of a string in bytes.

Another test might be to write a perl script that reads from your -S prog
script instead of using swish -- that might make it easier to see what's
happening.  You could probably easily edit exprog.c to see what swish is
reading from the pipe, too.

There's also tools to dump scalars in Perl (my memory is not helping
today) that will show how the string is encoded.

I had a problem lately where I was taking some HTML text, un-escaped the
HTML entities, and then split it based on a set of characters
(WordCharacters from swish) to split into word.  This was bombing out on
one document which turned out to have an entity that was a unicode char.
That seemd to trigger Perl to encode the string as UTF-8 (IIRC), but then
the split ended up splitting the string *between* a multi-byte character,
and then later reporting an invalid UTF-8 string!

No I have not played with use bytes.  I haven't had the need yet (perhaps
because of my platform). It may be that I have not had any docs that end
up a unicode or utf-8 in perl.

All I do in these case is find the smallest test case I can work with and
then start dumping things byte-by-byte.  Once I understand what's
happening with the data it's easier to find a solution.  Or so I hope.

Try creating a doc right in the -S prog script.  Like in the use bytes man
page $x = chr(400), or maybe use entites and call unescapeHTML().

Bill Moseley
Received on Fri Feb 28 17:15:50 2003