Bill, I want to leave everything exactly as it was in original. Nothing
else. It that possible?
----- Original Message -----
From: "Bill Moseley" <email@example.com>
To: "John Angel" <firstname.lastname@example.org>
Cc: "Multiple recipients of list" <email@example.com>
Sent: Wednesday, December 10, 2003 23:40
Subject: Re: [SWISH-E] Fw: Re: 8-bit chars
> On Wed, Dec 10, 2003 at 01:42:17PM -0800, John Angel wrote:
> > Here it is:
> Hi John,
> I'm not sure what you are asking. If I index with the HTML parser the
> chars are indexed. If I index with the libxml2 parser they are not
> indexed (of course I had to add the characters to *Characters settings).
> Note what happens if use the iconv utility:
> moseley@bumby:~$ iconv -f WINDOWS-1250 -t LATIN1 test.htm
> <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> <P>Non-english chars: iconv: illegal input sequence at position 108
> 108 is 6c hex:
> 00000060 6c 69 73 68 20 63 68 61 72 73 3a 20 f0 2c 20 9e |lish chars:
> Which is f0. That's a valid windows-1250 char (a small "d" with a line
> through it). If there's no 8859-1 character like that then it makes
> sense it won't convert.
> I'm not sure what you want. Do you want to convert to Windows-1250
> character set instead of 8859-1 when parsing? If so, you would need to
> edit parser.c and use the iconv library to do your conversion. I
> suppose you would have to carefully edit your WordCharacter (and other)
> settings so you are adding the right characters (based on your editor's
> character set). And as I mentioned, swish-e uses tolower() function
> and the LC_CTYPE locale is set to the default type. So case conversion
> may end up with odd results for some characters.
> I'm not sure why swish-e sets the LC_CTYPE locale.
> Interesting that when I read test.htm file with mozilla and a web server
> it ignores the meta tag and says the file is 8859-1 but if I read it
> without the web server it says it's Windows-1250.
> Bill Moseley
Received on Wed Dec 10 23:00:30 2003