Skip to main content.
home | support | download

Back to List Archive

Re: Handling of HTML entities

From: Pieter Claerhout <Pieter.Claerhout(at)>
Date: Thu Mar 25 2004 - 16:25:12 GMT
This is bad news for me, as this makes it impossible for me to use SwishE in
my project.



-----Original Message-----
From: [] On
Behalf Of Peter Karman
Sent: 25 March 2004 17:10
To: Multiple recipients of list
Subject: [SWISH-E] Re: Handling of HTML entities

I think this harkens back to the Unicode vs ISO-8859 debacle that seems 
to recur (someone correct me here).

Your Unicode character entities are being converted to 8859 and stored 
that way, as (I think) whitespace, in the index. So searching on 
entities won't work.

The ConvertHTMLEntities config option only applies when *not* using 
libxml2 as your parser, so you're out of luck that way too.

Not good news, if I am understanding this correctly.

Search the email archives for more on the Unicode thread. As I 
understand it, it's a major re-write to support and no one has stepped 
forward to say "me! me! I'll do it!".


Pieter Claerhout supposedly wrote on 03/25/2004 09:17 AM:
> Hi all,
> I recently started using Swish-E for indexing some HTML content. The
> indexing works just fine, but I'm still struggling with the search part
> using the command line.
> In the HTML I index, there are a lot of HTML entities embedded. So far, no
> problem as everything indexes just fine.
> However, if I want to do a search, the command line doesn't accept html
> entities in the search string, but requires the original unicode
> Is there a way to have it accept HTML entities for searching?
> An example:
> The document that get's indexed looks as follows:
> <html>
> <head>
>     <title>beInformed 1.0</title>
>     <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
> </head>
> <body>
>     <p>&#12400;&#12435;&#12372;&#12399;&#12435;</p>
> </body>
> </html>
> The search command I tried is as follows:
> C:\>swish-e -w "&#12400;&#12435;&#12372;&#12399;&#12435;"
> # SWISH format: 2.4.1
> # Search words: &#12400;&#12435;&#12372;&#12399;&#12435;
> # Removed stopwords:
> err: no results
> .
> Is there a way to make this work? I don't want to use the native
> in the command line (they are Japanese)...
> Thanks in advance,
> pieter

Peter Karman - Software Publications Programmer - Cray Inc
phone: 651-605-9009 -
Received on Thu Mar 25 08:25:12 2004