Skip to main content.
home | support | download

Back to List Archive

Re: Problems with ISO 8859-1 to UTF-8 Conversion?

From: <jmruiz(at)not-real.boe.es>
Date: Fri Sep 13 2002 - 15:15:28 GMT
Hi Thomas,

Yep, you are right.

I had also this problem some weeks ago.
I am not 100% sure but the expat library seems to ignore the ISO
header.
Use libxml2 instead (XML2) and this problem will be gone.

By te way, for spanish I use:
TranslateCharacters  aaeeiioouu

This will index "Jos" as "Jose"

cu
Jose

On 13 Sep 2002, at 3:20, Thomas Seifert wrote:

> Hi,
> 
> i play around with Swish-E and french texts the last few days and i've
> encountered a problem that i can't solve.
> 
> I'm indexing XML-files (via -S prog parameter) like this one:
> --------------- snip -----------------------------
> Path-Name: /tvtitel/287358
> Content-Length: 208
> Last-Mtime: 1031911762
> Document-Type: XML
> 
> <?xml version="1.0" encoding="ISO-8859-1"?>
> <xml>
> <titel>L'instit : Le choix de Tho</titel><desc>francetlvision
> (France2)|1 Boulevard Victor, Immeuble Le Barjac|F75015|Paris|11-09-02
> 21:10||||</desc></xml> --------------- snip
> -----------------------------
> 
> In the Config File I use the "TranslateCharacters :ascii7:" Parameter
> which should index "Tho" as "Theo" (as I understood with this feature
> only the Index is converted, not the actual text) so that i could
> search for "theo" and find the above document.
> 
> When printing the keywords (with -k '*') I can't find the word "theo":
> --------------- snip ----------------------------- ... thac thaco the
> ti ticket ... --------------- snip -----------------------------
> 
> When im Searching for "theo" i get no results, when searching for
> "th*" I get this result: --------------- snip
> ----------------------------- # SWISH format: 2.1-dev-26 # Search
> words: titel=(th*) # Number of hits: 4 # Search time: 0.001 seconds #
> Run time: 0.038 seconds L'instit : Le choix de Théo The Brian Benben
> Show Thaïlande Thé ou café --------------- snip
> -----------------------------
> 
> For me It looks like that the conversion from UTF-8, that is used
> internally by the libxml, back to ISO-8859-1 for the indexer doesn't
> work. But there is no error report when indexing.
> 
> Any Ideas?
> 
> thanks,
> thomas
> 
Received on Fri Sep 13 15:19:03 2002