Skip to main content.
home | support | download

Back to List Archive

Re: Once again filename encoding problems - Macos-X Tiger

From: Thomas Nyman <thomas(at)not-real.teg.pp.se>
Date: Thu Nov 17 2005 - 14:52:40 GMT
Thanks .. good feeback..gives me something to look at


17 nov 2005 kl. 14.24 skrev Bill Moseley:

> On Thu, Nov 17, 2005 at 10:30:12AM +0100, Thomas Nyman wrote:
>
>> Anyway, if i change the following setting in TempleteDefault  -  my
>> $output =  $q->header . page_header( $results ); - to my $output =
>> $q->header(-charset=>'UTF-8') . page_header( $results );  then
>> filenames are displayed correctly with regards to umlauts .. however
>> the content of swishdescription displays incorrectly then.
>
> Sorry, this kind of thing take a lot of time.  And I have not worked
> enough with different encodings.  When I've looked at encodings in the
> past I spent a lot of time dumping Perl SVs, using wget, and using od
> to dump bytes of my source files.
>
> Again, swish uses libxml2 for parsing document.  libxml2 can parse
> utf-8 (or most other encodings) and uses utf-8 internally but then
> swish takes that utf-8 and converts it to 8859-1 encoding.  So any
> characters that don't map to 8859-1 are replaced by a space and the
> parser should warn if that's happening (see ParserWarnLevel).
>
> Now, I would think that if just took 8859-1 encoded document and put
> it on the web with a content-type of utf-8 then only the first 127
> chars would display correctly, since those map to the same chars in
> utf-8.
>
> You could take the output from swish and use iconv to convert 8859-1
> to utf-8 before sending to the browser and then I would think that
> then you would be able to see all the chars.
>
>
> Now, things are much more complex.  There's Perl.  For one thing, you
> have edited the Perl source files with an editor (on OS X?) and so
> when you saved that file it encoded into utf-8 (I assume).  And if so,
> then you might need to tell Perl that your source files are utf-8.
> see perldoc utf8 and perldoc encoding.
>
>     use utf8;
>
> The output from swish also goes through perl, of course.  What happens
> to those characters?  I'm not quite sure with current Perl versions.
> For a while there was a utf8 flag on scalars (SVs) and sometimes Perl
> would have that flag set.  But, you might need to tell Perl that it's
> 8859-1.  (And it gets tricky, because SWISH::API uses Perl's xs (C
> interface) and dumps swish data right into Perl variables without
> consideration of encoding.)
>
> And the for those using template like Template Toolkit, there are
> other issues with how the templates are encoded.
>
>> Since the bulk of the documents are word documents being parsed
>> through catdoc  i changed my swish.conf as follows
>> FileFilter .doc /usr/local/bin/catdoc "-b -s8859-1 -dutf-8 '%p' "
>
> Are your Word docs really encoded in 8859-1?  (Or do they contain
> UTF-16 and then -s8859-1 is ignored?)
>
>> The results now show correct filenames with umlauts however there are
>> still some parts displaying incorrectly. The descriptions of the file
>> contents and highlighting is pretty much correct with one or two
>> faulty representations but now parts of the form are displaying
>> incorrectly.  I'm enclosing a screendump of what it looks like.
>> Oh, and the browsers default encoding is utf-8
>
> My guess there is that Perl is not reading your source files
> correctly.  If you are saving them as utf-8 you might need to "use
> utf8" at the top of the file.
>
>> The issue seems to point towards some part of the html page being
>> produced is setting an encoding other than utf-8..question is where
>> this is being set?
>
> That is the question. ;)  I, quite unfortunately, do things the slow
> way so I'd be sitting there with od and looking at the various bits of
> the output from the perl script directly.  I would then move to using
> wget to fetch the output via the web server to see if anything
> changes.
>
> The web browser is kind of a wild card, so it's best to make sure you
> know what the bytes say first.  If the bytes are really utf-8 then the
> browser needs to be told that.
>
> Or maybe just adding utf8 to the perl files you edit will be enough.
>
> -- 
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
>    http://swish-e.org/Discussion/
>
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
>
Received on Thu Nov 17 06:52:42 2005