Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Wed Dec 10 2003 - 23:00:25 GMT
Bill, I want to leave everything exactly as it was in original. Nothing
else. It that possible?


----- Original Message ----- 
From: "Bill Moseley" <moseley@hank.org>
To: "John Angel" <angel_john@hotmail.com>
Cc: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
Sent: Wednesday, December 10, 2003 23:40
Subject: Re: [SWISH-E] Fw: Re: 8-bit chars


> On Wed, Dec 10, 2003 at 01:42:17PM -0800, John Angel wrote:
> > Here it is:
>
> Hi John,
>
> I'm not sure what you are asking.  If I index with the HTML parser the
> chars are indexed.  If I index with the libxml2 parser they are not
> indexed (of course I had to add the characters to *Characters settings).
>
> Note what happens if use the iconv utility:
>
> moseley@bumby:~$ iconv -f WINDOWS-1250 -t LATIN1 test.htm
> <HTML>
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> charset=Windows-1250">
>
> <P>Non-english chars: iconv: illegal input sequence at position 108
>
> 108 is 6c hex:
>
> 00000060  6c 69 73 68 20 63 68 61  72 73 3a 20 f0 2c 20 9e  |lish chars:
, .|
>
> Which is f0.  That's a valid windows-1250 char (a small "d" with a line
> through it).  If there's no 8859-1 character like that then it makes
> sense it won't convert.
>
> I'm not sure what you want.  Do you want to convert to Windows-1250
> character set instead of 8859-1 when parsing?  If so, you would need to
> edit parser.c and use the iconv library to do your conversion.  I
> suppose you would have to carefully edit your WordCharacter (and other)
> settings so you are adding the right characters (based on your editor's
> character set).  And as I mentioned, swish-e uses tolower() function
> and the LC_CTYPE locale is set to the default type.  So case conversion
> may end up with odd results for some characters.
>
> I'm not sure why swish-e sets the LC_CTYPE locale.
>
> Interesting that when I read test.htm file with mozilla and a web server
> it ignores the meta tag and says the file is 8859-1 but if I read it
> without the web server it says it's Windows-1250.
>
>
> -- 
> Bill Moseley
> moseley@hank.org
>
>
Received on Wed Dec 10 23:00:30 2003