Skip to main content.
home | support | download

Back to List Archive

Re: Indexing UTF-8 IIS Pages

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Aug 04 2004 - 17:38:20 GMT
On Wed, Aug 04, 2004 at 05:58:37PM +0200, Mammitzsch.T@zdf.de wrote:
> > 
> > 
> > On Wed, Aug 04, 2004 at 04:50:46AM -0700, Mammitzsch.T@zdf.de wrote:
> > > Hi everybody,
> > > 
> > > i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> > > http-header. As far as i understood the manual, swish-e 
> > converts utf-8 to
> > > iso-8859-1 if i use libxml2 (html2-parser). Unfortunately 
> > special chars like
> > > german umlauts are not recognized if i search through the swish.cgi
> > > frontend. Also results with umlauts are not displayed 
> > correctly. swish-e
> > > runs on a sun e450 with solaris 5.8. Any ideas?
> > 
> > Basically what Peter said.  One thing you should try is while indexing
> > and spidering (a few small test files) use the options 
> > 
> >     -T parsed_words indexed_words 
> > 
> > which will show you what white-space separated words are being fed to
> > swish and how they are converted into words stored in the index (via
> > WordCharacters setting).
> > 
> ok, indexer did e.g. 
> 
> White-space found word 'Saarbrucken'
>     Adding:[648:swishdefault(1)]   'saarbrucken'   Pos:397  Stuct:0x9 ( BODY
> FILE )
> 
> looks good for me, but searching for  saarbrucken returns lots of results
> where "saarbrucken" is not included.

So the umlauts got stripped along the way.  But, you can see them when
indexing, right?

    moseley@bumby:~$ cat txt
    saarbrücken

    moseley@bumby:~$ swish-e -i txt -v0 -T indexed_words
        Adding:[1:swishdefault(1)]   'saarbrücken'   Pos:5  Stuct:0x9 ( BODY FILE )

    moseley@bumby:~$ swish-e -w saarbrücken -H0
    1000 txt "txt" 13

    moseley@bumby:~$ swish-e -w saarbrücken -H9 | grep Parsed
    # Parsed Words: saarbrücke

> other words with umlauts return no results (except 1 pdf which i found).

Well, the -T option shows what text is placed in the index.  And
"Parsed Words" shows what words are searched for in the index.  Those
two things will help you figure out why you can or cannot search for
some text.  

I suppose there could be some encoding issue, but even if that
was true then I would expect it to not be an issue unless you are
somehow using different encodings when indexing and when searching.
I've never seen that to be the case.


> why isn't it working when searching?

That's something you need to answer by using the debugging options.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Aug 4 10:38:50 2004