Re: Fw: Re: 8-bit chars

From: Bill Conlon <bill(at)>
Date: Sun Dec 14 2003 - 19:34:53 GMT
Let me try:

Two documents:

Document 1 Encoding A:  has the word 'cat' represented as numbers 123.
Document 2 Encoding B:  has the word 'dog' represented as numbers 123.

Both documents are spidered.  So the index has some pointers, and the 
pointer for the word represented as '123" points to both Document 1 and 
Document 2.

The index does not know the encoding.  So when I search for 'cat' I get 
two documents, even though one only contains the word with the meaning 

What I think you are asking for is to add the encoding to the index, so 
instead of just a representation:
123 --> Document1, Document 2

you want
A123 --> Document 1
B123 --> Document 2

Now what do you do about Encoding C, where 'cat' is also represented as 

C123 --> Document 3

Now I search for cat = A123, and only obtain Document 1, even though 
semantically, I want both Document 1 and Document 3.

The index is useful because it captures 'meaning'.  How do you propose to 
build in a semantic parser so that the index can know the word 'cat' is 
what is meant by different encodings.  That is how do we know that 

A123 is equivalent to C123, but is different from B123?

>> There is NO WAY to store more than one encoding in the index as it is
>> currently designed.
>> And that's exactly what you are asking to do.  You want to have libxml2
>> convert the document back to it's original encoding when storing the
>> words in the index -- "as-is" -- and that's trying to store more than
>> one encoding in the index at the same time.
>Yes, that is exactly what I am asking to do.
>Forget about encodings, you won't see the wider picture.
>Think how can we index documents presented in 3 different languages (without
>utf-8 support)? This is the only solution, and it works.

