Re: Indexing pdf files

From: Bill Moseley <moseley(at)>
Date: Thu Jan 30 2003 - 02:25:38 GMT
On Wed, 29 Jan 2003, David Cogley wrote:

> On Wed, 29 Jan 2003, Bill Moseley wrote:
> > What's your LANG environment variable set to?
> en_US.UTF-8


  export LANG=en_US


or better remove that tr/// line.

I've been meaning to ask on a perl list why there's that error.  There's
some discusson of it on perlmonks.

I wish I understood what's happening.  My guess has been that the text is
in UTF-8 and the tr/// is operating on bytes instead of chars and that
breaks the char encoding.  That's hard to belive, though.  I just have not
spent the time to grok in all the encoding stuff in Perl yet.

I'm not clear what that LANG setting is doing, and what's happening with
pdftotext.  I assume that the pdftotext program is outputting UTF-8.  And
perl also assumes that text is in UTF-8 from the LANG setting.  I'm not
sure if libxml2 would detect that it's UTF-8.

Can anyone explain all this encoding stuff in a few paragraphs? ;)

Bill Moseley
Received on Thu Jan 30 02:26:01 2003