Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Dec 10 2003 - 23:10:39 GMT
On Wed, Dec 10, 2003 at 03:00:22PM -0800, John Angel wrote:
> Bill, I want to leave everything exactly as it was in original. Nothing
> else. It that possible?

Not with using libxml2 because it does character conversions.
You can use the HTML parser, but it has a lot to be desired over the 
libxml2 parser (it's helpful to try both parsers and compare what gets 
indexed).

Still, there is character conversion via the tolower() function.  I'm 
not sure if that would cause problems or not.  I assume not in this case 
(i.e. tolower would ignore those high bit chars).  It would be 
interesting to try a simple C program and see what tolower does.

If you do leave your encoding at windows-1250 then I assume you would 
need to be sure that's also the case on searching.


> 
> 
> ----- Original Message ----- 
> From: "Bill Moseley" <moseley@hank.org>
> To: "John Angel" <angel_john@hotmail.com>
> Cc: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
> Sent: Wednesday, December 10, 2003 23:40
> Subject: Re: [SWISH-E] Fw: Re: 8-bit chars
> 
> 
> > On Wed, Dec 10, 2003 at 01:42:17PM -0800, John Angel wrote:
> > > Here it is:
> >
> > Hi John,
> >
> > I'm not sure what you are asking.  If I index with the HTML parser the
> > chars are indexed.  If I index with the libxml2 parser they are not
> > indexed (of course I had to add the characters to *Characters settings).
> >
> > Note what happens if use the iconv utility:
> >
> > moseley@bumby:~$ iconv -f WINDOWS-1250 -t LATIN1 test.htm
> > <HTML>
> > <META HTTP-EQUIV="Content-Type" CONTENT="text/html;
> > charset=Windows-1250">
> >
> > <P>Non-english chars: iconv: illegal input sequence at position 108
> >
> > 108 is 6c hex:
> >
> > 00000060  6c 69 73 68 20 63 68 61  72 73 3a 20 f0 2c 20 9e  |lish chars:
> , .|
> >
> > Which is f0.  That's a valid windows-1250 char (a small "d" with a line
> > through it).  If there's no 8859-1 character like that then it makes
> > sense it won't convert.
> >
> > I'm not sure what you want.  Do you want to convert to Windows-1250
> > character set instead of 8859-1 when parsing?  If so, you would need to
> > edit parser.c and use the iconv library to do your conversion.  I
> > suppose you would have to carefully edit your WordCharacter (and other)
> > settings so you are adding the right characters (based on your editor's
> > character set).  And as I mentioned, swish-e uses tolower() function
> > and the LC_CTYPE locale is set to the default type.  So case conversion
> > may end up with odd results for some characters.
> >
> > I'm not sure why swish-e sets the LC_CTYPE locale.
> >
> > Interesting that when I read test.htm file with mozilla and a web server
> > it ignores the meta tag and says the file is 8859-1 but if I read it
> > without the web server it says it's Windows-1250.
> >
> >
> > -- 
> > Bill Moseley
> > moseley@hank.org
> >
> >
> 

-- 
Bill Moseley
moseley@hank.org
Received on Wed Dec 10 23:11:35 2003