Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)>
Date: Fri Dec 12 2003 - 15:13:25 GMT
On Thu, Dec 11, 2003 at 11:51:08PM -0800, John Angel wrote:
> > > Why it wouldn't work with different encodings? 
> > 
> > All the data in the index has to be of the same encoding.
> > Can't very well index your window-1250 F0 character (a "d" with a line
> > through it) and an 8859-1 F0 character ("eth, Icelandic") in the same
> > index.  They are different characters.
> What difference does it make? Both will be valid chars.

The above wasn't clear?  The index stores numbers, not characters. The
index does not retain the encoding of every word it indexes.  

Encodings map charters to numbers.  There's more than one mapping

Think of your suggestion.  One document is 1250 and it includes a word
with the "d"-slash character.  That word gets indexed -- since the index
stores numbers (not characters) that stored word includes the F0 byte. 
The next document is in 8859-1 and it includes some word with the "eth"
character (it's an Icelandic document, I suppose) and that gets indexed,
and again there's a word that includes byte F0 in the index.  

Now you have a value in the index "F0" that represents more than one
character.  So when searching are you looking for a 1250 char or 8859-1
char?  You can't tell.

If that's not clear, pry up the keys on your keyboard and relocate 
them.  Then, looking at your keyboard, try entering some topic 
in your search engine.

Bill Moseley
Received on Fri Dec 12 15:14:06 2003