Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] ISO-8859-9 Encoding

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Apr 22 2009 - 17:38:01 GMT
Fatih Aytaç wrote on 4/14/09 12:21 PM:
> I want to index iso-8859-9 encoded (Turkish) HTML files. I saved my html
> files with encoding iso-8859-9 and changed the charset metatag to
> charset=iso-8859-9.
> After I indexed my files with default settings and made a search for a
> keyword with Turkish chars (ex: danıştay) I got 0 results. I know a page
> which uses swish-e with Turkish chars. (
> http://www.kazanci.com/cgi-bin/ara.cgi     or with the turkish query
> http://www.kazanci.com/cgi-bin/ara.cgi?query=dan%c4%b1%c5%9ftay&sort=swishrank&si=0&si=3&si=8&start=10
> )
> I want to know how can I enable iso-8859-9 support for swish-e to index and
> search my iso-8859-9 encoded files.

Short answer: make sure you are not using the libxml2 (HTML2) parser. Explicitly
use the expat parser with this in your config file:

  DefaultContents HTML

Long answer:

I had hoped someone with first-hand knowledge of this issue would have responded
by now, but since they haven't, I'll take a stab at it.

Here are some basic facts so we both are on the same page:

1. If you are using the libxml2 parser (HMTL2) then all your text is converted
to UTF-8 internally, then to Latin1 (iso-8859-1). I believe the locale you run
under and the declared encoding in the files is used only for internal
conversion to UTF-8.

2. If you are using the expat parser (HTML) then all your text is indexed as-is.

3. WordCharacters and its related config options controls which *bytes* are
considered valid token components. How those bytes are displayed to you depends
on which locale you are running under.

4. The swish-e query parser is single-byte only, and compares each byte in the
query against the same WordCharacter definitions used during indexing. So
WordCharacters and its cousins are important at both indexing and searching time.


Based on those facts above, I did an experiment.

First, I created a test .html file with the word 'danıştay' in it. The file was
encoded as UTF-8. I have Swish-e 2.4.7 compiled with libxml2, so it uses the
HTML2 parser by default. I indexed it using all the default config options
(i.e., no config file). Then I searched for the word 'danıştay' and got one hit.

Cool, I thought. Just encode as UTF-8 instead of 8859-9 and it Does the Right Thing.

However, when I ran the indexing again with debugging options turned on, I saw
that the word 'danıştay' was being tokenized as 3 distinct tokens. You can see
why in the debugging output below.

Here are the UTF-8 codepoints, first in hex:

0064  0061  006e  0131  015f  0074  0061  0079 |  d a n ı ş t a y

or in decimal:

00100 00097 00110 00305 00351 00116 00097 00121 |  d a n ı ş t a y


What happens is that the codepoint x0131 is being converted to Latin1 using the
libxml2 (iconv if you have it compiled that way) routine and it ends up as
x00e4. Or in glyph terms:

  ı  =>  ä

Likewise, x015f is converted to x00e5. Or:

  ş  =>  å


In other words, messy and wrong. It should convert to:

 dec       hex
--------------------
 00253  ı  000fd
 00254  ş  000fe


The reason the search part finds it is that they query parser seems to do the
same tokenization. So they are consistently wrong. Which is something, anyway.

[karpet@pekmac:~/tmp]$ swish-e -T parsed_text -T properties -T indexed_words \
-i turkish-utf8.html  -v 9
Parsing config file 'turkish.conf'
Indexing Data Source: "File-System"
Indexing "turkish-utf8.html"

Checking file "turkish-utf8.html"...
  turkish-utf8.html - Using DEFAULT (HTML2) parser - this page body is 8859-9
    Adding:[1:swishdefault(1)]   'this'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'page'   Pos:6  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'body'   Pos:7  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'is'   Pos:8  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   '8859'   Pos:9  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   '9'   Pos:10  Stuct:0x7 ( HEAD TITLE FILE )
dantays
    Adding:[1:swishdefault(1)]   'danä'   Pos:16  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'å'   Pos:17  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'tay'   Pos:18  Stuct:0x9 ( BODY FILE )
 (9 words)
          swishdocpath: 6 ( 17) S: "turkish-utf8.html"
            swishtitle: 7 ( 24) S: "this page body is 8859-9"
          swishdocsize: 8 (  4) N: "181"
     swishlastmodified: 9 (  4) D: "2009-04-22 10:21:06 CDT"

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 9 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
9 unique words indexed.
4 properties sorted.
1 file indexed.  181 total bytes.  9 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@pekmac:~/tmp]$ swish-e -w danıştay
# SWISH format: 2.4.7
# Search words: danıştay
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.006 seconds
1000 turkish-utf8.html "this page body is 8859-9" 181
.
[karpet@pekmac:~/tmp]$ cat turkish-utf8.html
<html>
 <head>
  <title>this page body is 8859-9</title>
<!-- the Content-Type meta seems to ignored by libxml2 -->
  <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-9" />
 </head>
 <body>
   danıştay
 </body>
</html>


I did the same experiment with turkish-88599.html, encoded as iso-8859-9 instead
of UTF-8. That failed to find the word because at index time it just ignored the
characters that were not in the iso-8859-9 charset. Then I tried again with the
same file, same locale, but explicitly told swish-e not to use the libxml2 parser:

Here's the output:

[karpet@pekmac:~/tmp]$ swish-e -T parsed_text -T properties -T indexed_words \
-i turkish-88599.html  -v 9 -c turkish.conf
Parsing config file 'turkish.conf'
Indexing Data Source: "File-System"
Indexing "turkish-88599.html"

Checking file "turkish-88599.html"...
  turkish-88599.html - Using HTML parser -     Adding:[1:swishdefault(1)]
'this'   Pos:1  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'page'   Pos:2  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'body'   Pos:3  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'is'   Pos:4  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   '8859'   Pos:5  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   '9'   Pos:6  Stuct:0x7 ( HEAD TITLE FILE )
    Adding:[1:swishdefault(1)]   'danıştay'   Pos:7  Stuct:0x9 ( BODY FILE )
 (7 words)
          swishdocpath: 6 ( 18) S: "turkish-88599.html"
            swishtitle: 7 ( 24) S: "this page body is 8859-9"
          swishdocsize: 8 (  4) N: "179"
     swishlastmodified: 9 (  4) D: "2009-04-22 10:24:26 CDT"

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 7 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
7 unique words indexed.
4 properties sorted.
1 file indexed.  179 total bytes.  7 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
[karpet@pekmac:~/tmp]$ cat turkish-88599.html
<html>
 <head>
  <title>this page body is 8859-9</title>
  <meta http-equiv="Content-Type" content="text/html;charset=iso-8859-9" />
 </head>
 <body>
   danıştay
 </body>
</html>

[karpet@pekmac:~/tmp]$ swish-e -w danıştay
# SWISH format: 2.4.7
# Search words: danıştay
# Removed stopwords:
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.006 seconds
1000 turkish-88599.html "this page body is 8859-9" 179
.

[karpet@pekmac:~/tmp]$ cat turkish.conf
DefaultContents HTML

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Apr 22 13:38:02 2009