Skip to main content.
home | support | download

Back to List Archive

Indexing umlauts

From: Thomas Nyman <thomas(at)not-real.teg.pp.se>
Date: Mon Dec 12 2005 - 20:09:16 GMT
Hi

I read the thread on indexing german umlauts and I have a similar  
problem.

I made a word document for testing.
The document contains the following two word

verskottslager

boy

when i run swish-e -c swish_se.conf -i test.doc -T indexed_words -v0

i get the following

Adding:[1:swishdocpath(11)]   'test'   Pos:1  Stuct:0x1 ( FILE )
     Adding:[1:swishdocpath(11)]   'doc'   Pos:2  Stuct:0x1 ( FILE )
     Adding:[1:swishdefault(1)]   'a'   Pos:1  Stuct:0x1 ( FILE )
     Adding:[1:swishdefault(1)]   'verskottslager'   Pos:2  Stuct:0x1  
( FILE )
     Adding:[1:swishdefault(1)]   'boy'   Pos:3  Stuct:0x1 ( FILE )

I would appear that my umlaut is being treated as a word and being  
split from the word it actually belongs to.

I have the following in my config file

TranslateCharacters :ascii7:
WordCharacters &0123456789_abcdefghijklmnopqrstuvwxyz
BeginCharacters &0123456789_abcdefghijklmnopqrstuvwxyz
EndCharacters +0123456789_abcdefghijklmnopqrstuvwxyz

Does anyone have any ideas as to what is doing this?

I could use the answer from the previous thread and make something  
like this

for ($query) {  # trim the query string
         s//O/;
         s/\s+$//;
         s/^\s+//;

but since the letter is being split that doesnt really help me

Thanks
Received on Mon Dec 12 12:09:22 2005