Skip to main content.
home | support | download

Back to List Archive

Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Sat Dec 06 2003 - 17:39:23 GMT
It could be some problem with 'locale' settings...

Is it possible to use different function instead of UTF8Toisolat1()? Maybe 
we can override it with our own function with character conversion table?

How to use old built-in HTML parser? configure --without-libxml?



> > Specific example is if you try to index word containing ASCII 154 char.
>
>That's a Windows extension to 8859-1, as far as I know.  I would not be
>suprised to find that libxml2 didn't support it.
>
>But it seems like I'm indexing with that character without any problem:
>
>moseley@bumby:~$ swish-e -V
>SWISH-E 2.4.1
>
>Ok, I have a file "word" that contains that character:
>
>moseley@bumby:~$ hexdump -C word
>00000000  61 62 63 9a 64 65 66 0a                           |abc.def.|
>                    ^^
>And here's the config file:
>
>moseley@bumby:~$ cat c
>
>WordCharacters abcdef
>BeginCharacters a
>EndCharacters f
>
>here it is with hexdump:
>
>moseley@bumby:~$ hexdump -C c
>00000000  0a 57 6f 72 64 43 68 61  72 61 63 74 65 72 73 20  
>|.WordCharacters |
>00000010  61 62 63 9a 64 65 66 0a  42 65 67 69 6e 43 68 61  
>|abc.def.BeginCha|
>00000020  72 61 63 74 65 72 73 20  61 0a 45 6e 64 43 68 61  |racters 
>a.EndCha|
>00000030  72 61 63 74 65 72 73 20  66 0a                    |racters f.|
>0000003a
>
>Now index.  You can see that it is indeed indexed (9a is your character)
>
>moseley@bumby:~$ swish-e -i word -T indexed_words -v0 -c c | hexdump -C
>00000000  20 20 20 20 41 64 64 69  6e 67 3a 5b 31 3a 73 77  |    
>Adding:[1:sw|
>00000010  69 73 68 64 65 66 61 75  6c 74 28 31 29 5d 20 20  |ishdefault(1)] 
>  |
>00000020  20 27 61 62 63 9a 64 65  66 27 20 20 20 50 6f 73  | 'abc.def'   
>Pos|
>00000030  3a 32 20 20 53 74 75 63  74 3a 30 78 39 20 28 20  |:2  Stuct:0x9 
>( |
>00000040  42 4f 44 59 20 46 49 4c  45 20 29 0a              |BODY FILE ).|
>0000004c
>
>Now try searching:
>
>moseley@bumby:~$ perl -le '$word = "abc".chr(154)."def"; print `swish-e -w 
>$word -H0`'
>1000 word "word" 8
>
>So it found the word.
>
>This doesn't find it (different character):
>
>moseley@bumby:~$ perl -le '$word = "abc".chr(153)."def"; print `swish-e -w 
>$word`'
># SWISH format: 2.4.1
># Search words: abc™def
># Removed stopwords:
>err: no results
>.
>
> > I assume the problem is conversion to Latin1 - as you said, it is not 
>100%
> > 8-bit clean. Is there some other function we could use to translate 
>UTF-8 to
> > 8-bit chars, instead of UTF8Toisolat1()? Or even better - to completely
> > avoid conversion to UTF-8, but to leave every char as it is originally.
>
>libxml2 works with utf-8.  Nothing I can do about that.  For really
>8-bit clean you might try indexing with HTML (not HTML2 or HTML*) which
>will use the old built-in (and reasonably broken) HTML parser.
>
>
>--
>Bill Moseley
>moseley@hank.org
>

_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online 
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
Received on Sat Dec 6 17:39:28 2003