Skip to main content.
home | support | download

Back to List Archive

Iso 8859-2 chars and HTML2 parser problem

From: Krzysztof Rudnik <K.Rudnik(at)not-real.rzeczpospolita.pl>
Date: Fri Sep 13 2002 - 13:53:36 GMT
Hi all,

I've just found the following unpredictible/strange/buggy  behavior of HTM2 
parser when parsing iso-8859-2 documents
Indexed document :
<html>
  <head>
  <meta http-equiv="content-type" content="text/html; charset=iso-8859-2">
    <title></title>
  </head>
  <body>
bałałajka  
</body>
</html>
swish-e -T INDEXED_WORDS gives different results in similar situations.
in every of them iso-8859-2 chatecters are ignored
(WordChars are correct in my conf file all iso8859-2 chars are included)
------------------------------------------------------------
1. when body is of the form : 
<body>
bałałajka 
</body>
swish-e gives 
  Adding:[1:swishdefault(1)]   'ba'   Pos:4  Stuct:0x9 ( BODY FILE )

2. when body is of the form : 
 <body>
any word bałałajka  
</body>
swish-e gives 
    Adding:[1:swishdefault(1)]   'any'   Pos:4  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'word'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'ba'   Pos:6  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'a'   Pos:7  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'ajka'   Pos:8  Stuct:0x9 ( BODY FILE )
3.
<body>
bałałajka
word 
</body>
gives 
Adding:[1:swishdefault(1)]   'ba'   Pos:4  Stuct:0x9 ( BODY FILE )
4. 
<body>
bałałajka
any word 
</body>
gives  
 Adding:[1:swishdefault(1)]   'ba'   Pos:4  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'a'   Pos:5  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'ajka'   Pos:6  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'any'   Pos:7  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'word'   Pos:8  Stuct:0x9 ( BODY FILE )

etc. 

The strange thing is that everything is perfect when replace 
<meta http-equiv="content-type" content="text/html; charset=iso-8859-2"> 
 by
<meta http-equiv="content-type" content="text/html"> 
in the head of the document 
 Adding:[1:swishdefault(1)]   'bałałajka'   Pos:4  Stuct:0x9 ( BODY FILE )
 Adding:[1:swishdefault(1)]   'any'   Pos:5  Stuct:0x9 ( BODY FILE )
 Adding:[1:swishdefault(1)]   'word'   Pos:6  Stuct:0x9 ( BODY FILE )



regards
Received on Fri Sep 13 13:57:10 2002