It could be some problem with 'locale' settings...
Is it possible to use different function instead of UTF8Toisolat1()? Maybe
we can override it with our own function with character conversion table?
How to use old built-in HTML parser? configure --without-libxml?
> > Specific example is if you try to index word containing ASCII 154 char.
>
>That's a Windows extension to 8859-1, as far as I know. I would not be
>suprised to find that libxml2 didn't support it.
>
>But it seems like I'm indexing with that character without any problem:
>
>moseley@bumby:~$ swish-e -V
>SWISH-E 2.4.1
>
>Ok, I have a file "word" that contains that character:
>
>moseley@bumby:~$ hexdump -C word
>00000000 61 62 63 9a 64 65 66 0a |abc.def.|
> ^^
>And here's the config file:
>
>moseley@bumby:~$ cat c
>
>WordCharacters abcdef
>BeginCharacters a
>EndCharacters f
>
>here it is with hexdump:
>
>moseley@bumby:~$ hexdump -C c
>00000000 0a 57 6f 72 64 43 68 61 72 61 63 74 65 72 73 20
>|.WordCharacters |
>00000010 61 62 63 9a 64 65 66 0a 42 65 67 69 6e 43 68 61
>|abc.def.BeginCha|
>00000020 72 61 63 74 65 72 73 20 61 0a 45 6e 64 43 68 61 |racters
>a.EndCha|
>00000030 72 61 63 74 65 72 73 20 66 0a |racters f.|
>0000003a
>
>Now index. You can see that it is indeed indexed (9a is your character)
>
>moseley@bumby:~$ swish-e -i word -T indexed_words -v0 -c c | hexdump -C
>00000000 20 20 20 20 41 64 64 69 6e 67 3a 5b 31 3a 73 77 |
>Adding:[1:sw|
>00000010 69 73 68 64 65 66 61 75 6c 74 28 31 29 5d 20 20 |ishdefault(1)]
> |
>00000020 20 27 61 62 63 9a 64 65 66 27 20 20 20 50 6f 73 | 'abc.def'
>Pos|
>00000030 3a 32 20 20 53 74 75 63 74 3a 30 78 39 20 28 20 |:2 Stuct:0x9
>( |
>00000040 42 4f 44 59 20 46 49 4c 45 20 29 0a |BODY FILE ).|
>0000004c
>
>Now try searching:
>
>moseley@bumby:~$ perl -le '$word = "abc".chr(154)."def"; print `swish-e -w
>$word -H0`'
>1000 word "word" 8
>
>So it found the word.
>
>This doesn't find it (different character):
>
>moseley@bumby:~$ perl -le '$word = "abc".chr(153)."def"; print `swish-e -w
>$word`'
># SWISH format: 2.4.1
># Search words: abc™def
># Removed stopwords:
>err: no results
>.
>
> > I assume the problem is conversion to Latin1 - as you said, it is not
>100%
> > 8-bit clean. Is there some other function we could use to translate
>UTF-8 to
> > 8-bit chars, instead of UTF8Toisolat1()? Or even better - to completely
> > avoid conversion to UTF-8, but to leave every char as it is originally.
>
>libxml2 works with utf-8. Nothing I can do about that. For really
>8-bit clean you might try indexing with HTML (not HTML2 or HTML*) which
>will use the old built-in (and reasonably broken) HTML parser.
>
>
>--
>Bill Moseley
>moseley@hank.org
>
_________________________________________________________________
Protect your PC - get McAfee.com VirusScan Online
http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
Received on Sat Dec 6 17:39:28 2003