Skip to main content.
home | support | download

Back to List Archive

Re: Fw: Re: 8-bit chars

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Dec 14 2003 - 21:24:45 GMT
On Sun, Dec 14, 2003 at 11:15:02AM -0800, Frances Coakley wrote:
> 
> > > There is NO WAY to store more than one encoding in the index as it is
> > > currently designed.
> 
> Doesnt the meta charset give you the coding used in the original document - 
> assuming that the 8bit chars are the more unusual chars then it is possible 
> that a word in Icelandic charset maps onto the same sequence of 8 bit chars 
> as would a different word in the Norse charset.  But if the searcher is 
> viewing with the charset Icelandic set then searching for Meta 
> Charset=Icelandic and word=whatever will find the Icelandic word.  Those 
> pages not encoded under the Icelandic charset cannot by definition contain 
> that char.
> Or have I misunderstood the problem ?

Yes, that would work because you are not mixing encodings in the 
same index.  John's suggestion was to index "as-is" which would mix 
encodings.  

Since metanames are sub-sets of documents (with the exception of some of
the true meta data like dates or pathnames) you would need a complete
duplicate set of metanames for each encoding found.  Probably would be 
easier to design a system that selects an index file based on character 
encoding.  But that's still limited to 8 bit character sets.  So utf-8 
is where the effort should go.

Some features in swish are based on characters being 8-bit.  I think the 
wild card feature (foo*) uses a 256 wide lookup table.  I can't remember 
for sure.

-- 
Bill Moseley
moseley@hank.org
Received on Sun Dec 14 21:24:53 2003