Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Dec 14 2003 - 21:05:38 GMT
On Sun, Dec 14, 2003 at 12:05:10PM -0800, John Angel wrote:
> You are right, but your examples are theory.
> 
> In practice, there are no cats and dogs both represented with 123.


No, Bill Conlon's example is right on.  I suggest you read it again.  I 
don't think he was being literal about "cat" and "dog" represented with 
123.  Did you?

> What is the alternative for proposed solution, since we don't have utf-8
> yet?

For what you are asking?  There is none.  The solution IS utf-8.  That's
why there is utf-8 (and Unicode) after all, to solve just this problem.

Maybe read this too:

  http://www.unicode.org/standard/WhatIsUnicode.html

or one of the other many, many sites that explain this.


> 
> 
> ----- Original Message ----- 
> From: "Bill Conlon" <bill@tothept.com>
> To: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
> Sent: Sunday, December 14, 2003 20:34
> Subject: [SWISH-E] Re: Fw: Re: 8-bit chars
> 
> 
> > Let me try:
> >
> > Two documents:
> >
> > Document 1 Encoding A:  has the word 'cat' represented as numbers 123.
> > Document 2 Encoding B:  has the word 'dog' represented as numbers 123.
> >
> > Both documents are spidered.  So the index has some pointers, and the
> > pointer for the word represented as '123" points to both Document 1 and
> > Document 2.
> >
> > The index does not know the encoding.  So when I search for 'cat' I get
> > two documents, even though one only contains the word with the meaning
> > 'cat'.
> >
> > What I think you are asking for is to add the encoding to the index, so
> > instead of just a representation:
> > 123 --> Document1, Document 2
> >
> > you want
> > A123 --> Document 1
> > B123 --> Document 2
> >
> > Now what do you do about Encoding C, where 'cat' is also represented as
> > 123?
> >
> > C123 --> Document 3
> >
> > Now I search for cat = A123, and only obtain Document 1, even though
> > semantically, I want both Document 1 and Document 3.
> >
> > The index is useful because it captures 'meaning'.  How do you propose to
> > build in a semantic parser so that the index can know the word 'cat' is
> > what is meant by different encodings.  That is how do we know that
> >
> > A123 is equivalent to C123, but is different from B123?
> >
> > >> There is NO WAY to store more than one encoding in the index as it is
> > >> currently designed.
> > >>
> > >> And that's exactly what you are asking to do.  You want to have libxml2
> > >> convert the document back to it's original encoding when storing the
> > >> words in the index -- "as-is" -- and that's trying to store more than
> > >> one encoding in the index at the same time.
> > >
> > >
> > >Yes, that is exactly what I am asking to do.
> > >
> > >Forget about encodings, you won't see the wider picture.
> > >
> > >Think how can we index documents presented in 3 different languages
> (without
> > >utf-8 support)? This is the only solution, and it works.
> > >
> >
> >
> > Bill Conlon
> >
> > To the Point
> > 345 California Avenue Suite 2
> > Palo Alto, CA 94306
> >
> > office: 650.327.2175
> > fax:    650.329.8335
> > mobile: 650.906.9929
> > e-mail: mailto:bill@tothept.com
> > web:    http://www.tothept.com
> >
> >
> >
> 

-- 
Bill Moseley
moseley@hank.org
Received on Sun Dec 14 21:05:47 2003