On Wed, Dec 10, 2003 at 01:42:17PM -0800, John Angel wrote:
> Here it is:
Hi John,
I'm not sure what you are asking. If I index with the HTML parser the
chars are indexed. If I index with the libxml2 parser they are not
indexed (of course I had to add the characters to *Characters settings).
Note what happens if use the iconv utility:
moseley@bumby:~$ iconv -f WINDOWS-1250 -t LATIN1 test.htm
<HTML>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1250">
<P>Non-english chars: iconv: illegal input sequence at position 108
108 is 6c hex:
00000060 6c 69 73 68 20 63 68 61 72 73 3a 20 f0 2c 20 9e |lish chars: ð, .|
Which is f0. That's a valid windows-1250 char (a small "d" with a line
through it). If there's no 8859-1 character like that then it makes
sense it won't convert.
I'm not sure what you want. Do you want to convert to Windows-1250
character set instead of 8859-1 when parsing? If so, you would need to
edit parser.c and use the iconv library to do your conversion. I
suppose you would have to carefully edit your WordCharacter (and other)
settings so you are adding the right characters (based on your editor's
character set). And as I mentioned, swish-e uses tolower() function
and the LC_CTYPE locale is set to the default type. So case conversion
may end up with odd results for some characters.
I'm not sure why swish-e sets the LC_CTYPE locale.
Interesting that when I read test.htm file with mozilla and a web server
it ignores the meta tag and says the file is 8859-1 but if I read it
without the web server it says it's Windows-1250.
--
Bill Moseley
moseley@hank.org
Received on Wed Dec 10 22:41:03 2003