Re: [swish-e] Encoding problems

From: Peter Karman <peter(at)>
Date: Wed Mar 17 2010 - 02:11:26 GMT
Patricio Mac Adden wrote on 3/16/10 11:58 AM:
> Hello, this is my first mail to this mailing list. I'm from La Plata,
> Argentina and my problem is this:
> I'm trying to index several document types: pdf, doc, xsl, txt, zip,
> etc.. The documents may be encoded with UTF-8 or ISO-8859-1. I'm also
> using the directive TranslateCharacters _áéíóúñ -aeioun so words "papá"
> is indexed as "papa" and so on.
> Supose that in my indexed dir I have 3 documents, 2 containing the text
> "papá" and 1 containing the text "papa". So:
> $ swish-e -w papa
> must give me 3 hits instead of 1.

Here's a little test that demonstrates that the idea of what you are trying to
do is correct, and it should work if the encodings are as you say they are. Try
breaking down your document collection into the smallest possible set that
reproduces the problem you are seeing. Often it's a configuration issue.

I created 2 .html files, one in iso-8859-1 and one in utf-8. I indexed them and
could find both by searching for plain ascii 'papa'.

NOTE that because this email is in utf-8, the latin1 character in papá will not
display correctly, instead rendering as a '?'.

NOTE that swish-e converts utf-8 to latin1 internally, so even if your source is
in utf-8, it is not indexed that way.

[karpet@pekmac:~/tmp/papa]$ swish-e -i *html -c papa.conf
Indexing Data Source: "File-System"
Indexing "papa1.html"
Indexing "papa2.html"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 9 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
9 unique words indexed.
4 properties sorted.
2 files indexed.  370 total bytes.  22 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@pekmac:~/tmp/papa]$ swish-e -w papa
# SWISH format: 2.5.8
# Search words: papa
# Removed stopwords:
# Number of hits: 2
# Search time: 0.000 seconds
# Run time: 0.007 seconds
1000 papa2.html "papa iso-8859-1" 192
1000 papa1.html "papa utf-8" 178
[karpet@pekmac:~/tmp/papa]$ cat papa.conf
DefaultContents HTML*
TranslateCharacters :ascii7:
[karpet@pekmac:~/tmp/papa]$ cat papa1.html
  <title>papa utf-8</title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
  this is a utf-8 papa: papá
[karpet@pekmac:~/tmp/papa]$ cat papa2.html
  <title>papa iso-8859-1</title>
  <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1" />
  this is a iso-8859-1 papa: pap?

Peter Karman  .  .  peter(at)
Users mailing list
Received on Tue Mar 16 22:11:28 2010