Skip to main content.
home | support | download

Back to List Archive

RE: AW: Indexing UTF-8 IIS Pages

From: <Mammitzsch.T(at)not-real.zdf.de>
Date: Wed Aug 04 2004 - 16:27:52 GMT
> > 
> > On Wed, Aug 04, 2004 at 04:50:46AM -0700, Mammitzsch.T@zdf.de wrote:
> > > Hi everybody,
> > > 
> > > i try to spider an IIS 6.0 which delivers pages with utf-8 in the
> > > http-header. As far as i understood the manual, swish-e 
> > converts utf-8 to
> > > iso-8859-1 if i use libxml2 (html2-parser). Unfortunately 
> > special chars like
> > > german umlauts are not recognized if i search through the 
> swish.cgi
> > > frontend. Also results with umlauts are not displayed 
> > correctly. swish-e
> > > runs on a sun e450 with solaris 5.8. Any ideas?
> > 
> > Basically what Peter said.  One thing you should try is 
> while indexing
> > and spidering (a few small test files) use the options 
> > 
> >     -T parsed_words indexed_words 
> > 
> > which will show you what white-space separated words are 
> being fed to
> > swish and how they are converted into words stored in the index (via
> > WordCharacters setting).
> > 
> ok, indexer did e.g. 
> 
> White-space found word 'Saarbrucken'
>     Adding:[648:swishdefault(1)]   'saarbrucken'   Pos:397  
> Stuct:0x9 ( BODY
> FILE )
> 
> looks good for me, but searching for  saarbrucken returns 
> lots of results
> where "saarbrucken" is not included.
> other words with umlauts return no results (except 1 pdf 
> which i found).
> 
> why isn't it working when searching?
> 
> bye, Thomas Mammitzsch

hmm, the umlauts are stripped out of my post. i originally wrote saarbrucken
with an "u" with two dots above (german umlaut).

bye, Thomas Mammitzsch
Received on Wed Aug 4 09:28:04 2004