Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: John Angel <angel_john(at)not-real.hotmail.com>
Date: Sun Dec 14 2003 - 21:50:43 GMT
We agreed that utf-8 is the right thing, but who knows when it will be
implemented.

I repeat the question - what is the alternative until utf-8 support is
implemented? You don't have one. Proposed solution is something which can be
used in the meantime.


----- Original Message ----- 
From: "Bill Moseley" <moseley@hank.org>
To: "John Angel" <angel_john@hotmail.com>
Cc: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
Sent: Sunday, December 14, 2003 22:05
Subject: Re: [SWISH-E] Re: Fw: Re: 8-bit chars


> On Sun, Dec 14, 2003 at 12:05:10PM -0800, John Angel wrote:
> > You are right, but your examples are theory.
> >
> > In practice, there are no cats and dogs both represented with 123.
>
>
> No, Bill Conlon's example is right on.  I suggest you read it again.  I
> don't think he was being literal about "cat" and "dog" represented with
> 123.  Did you?
>
> > What is the alternative for proposed solution, since we don't have utf-8
> > yet?
>
> For what you are asking?  There is none.  The solution IS utf-8.  That's
> why there is utf-8 (and Unicode) after all, to solve just this problem.
>
> Maybe read this too:
>
>   http://www.unicode.org/standard/WhatIsUnicode.html
>
> or one of the other many, many sites that explain this.
>
>
> >
> >
> > ----- Original Message ----- 
> > From: "Bill Conlon" <bill@tothept.com>
> > To: "Multiple recipients of list" <swish-e@sunsite.berkeley.edu>
> > Sent: Sunday, December 14, 2003 20:34
> > Subject: [SWISH-E] Re: Fw: Re: 8-bit chars
> >
> >
> > > Let me try:
> > >
> > > Two documents:
> > >
> > > Document 1 Encoding A:  has the word 'cat' represented as numbers 123.
> > > Document 2 Encoding B:  has the word 'dog' represented as numbers 123.
> > >
> > > Both documents are spidered.  So the index has some pointers, and the
> > > pointer for the word represented as '123" points to both Document 1
and
> > > Document 2.
> > >
> > > The index does not know the encoding.  So when I search for 'cat' I
get
> > > two documents, even though one only contains the word with the meaning
> > > 'cat'.
> > >
> > > What I think you are asking for is to add the encoding to the index,
so
> > > instead of just a representation:
> > > 123 --> Document1, Document 2
> > >
> > > you want
> > > A123 --> Document 1
> > > B123 --> Document 2
> > >
> > > Now what do you do about Encoding C, where 'cat' is also represented
as
> > > 123?
> > >
> > > C123 --> Document 3
> > >
> > > Now I search for cat = A123, and only obtain Document 1, even though
> > > semantically, I want both Document 1 and Document 3.
> > >
> > > The index is useful because it captures 'meaning'.  How do you propose
to
> > > build in a semantic parser so that the index can know the word 'cat'
is
> > > what is meant by different encodings.  That is how do we know that
> > >
> > > A123 is equivalent to C123, but is different from B123?
> > >
> > > >> There is NO WAY to store more than one encoding in the index as it
is
> > > >> currently designed.
> > > >>
> > > >> And that's exactly what you are asking to do.  You want to have
libxml2
> > > >> convert the document back to it's original encoding when storing
the
> > > >> words in the index -- "as-is" -- and that's trying to store more
than
> > > >> one encoding in the index at the same time.
> > > >
> > > >
> > > >Yes, that is exactly what I am asking to do.
> > > >
> > > >Forget about encodings, you won't see the wider picture.
> > > >
> > > >Think how can we index documents presented in 3 different languages
> > (without
> > > >utf-8 support)? This is the only solution, and it works.
> > > >
> > >
> > >
> > > Bill Conlon
> > >
> > > To the Point
> > > 345 California Avenue Suite 2
> > > Palo Alto, CA 94306
> > >
> > > office: 650.327.2175
> > > fax:    650.329.8335
> > > mobile: 650.906.9929
> > > e-mail: mailto:bill@tothept.com
> > > web:    http://www.tothept.com
> > >
> > >
> > >
> >
>
> -- 
> Bill Moseley
> moseley@hank.org
>
>
Received on Sun Dec 14 21:50:49 2003