Re: Handling of HTML entities

From: Peter Karman <karman(at)>
Date: Thu Mar 25 2004 - 16:10:38 GMT
I think this harkens back to the Unicode vs ISO-8859 debacle that seems 
to recur (someone correct me here).

Your Unicode character entities are being converted to 8859 and stored 
that way, as (I think) whitespace, in the index. So searching on 
entities won't work.

The ConvertHTMLEntities config option only applies when *not* using 
libxml2 as your parser, so you're out of luck that way too.

Not good news, if I am understanding this correctly.

Search the email archives for more on the Unicode thread. As I 
understand it, it's a major re-write to support and no one has stepped 
forward to say "me! me! I'll do it!".


Pieter Claerhout supposedly wrote on 03/25/2004 09:17 AM:
> Hi all,
> I recently started using Swish-E for indexing some HTML content. The
> indexing works just fine, but I'm still struggling with the search part
> using the command line.
> In the HTML I index, there are a lot of HTML entities embedded. So far, no
> problem as everything indexes just fine.
> However, if I want to do a search, the command line doesn't accept html
> entities in the search string, but requires the original unicode characters.
> Is there a way to have it accept HTML entities for searching?
> An example:
> The document that get's indexed looks as follows:
> <html>
> <head>
>     <title>beInformed 1.0</title>
>     <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> </head>
> <body>
>     <p>&#12400;&#12435;&#12372;&#12399;&#12435;</p>
> </body>
> </html>
> The search command I tried is as follows:
> C:\>swish-e -w "&#12400;&#12435;&#12372;&#12399;&#12435;"
> # SWISH format: 2.4.1
> # Search words: &#12400;&#12435;&#12372;&#12399;&#12435;
> # Removed stopwords:
> err: no results
> .
> Is there a way to make this work? I don't want to use the native characters
> in the command line (they are Japanese)...
> Thanks in advance,
> pieter

Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 -
Received on Thu Mar 25 08:10:39 2004