Skip to main content.
home | support | download

Back to List Archive

Handling of HTML entities

From: Pieter Claerhout <Pieter.Claerhout(at)not-real.Creo.com>
Date: Thu Mar 25 2004 - 15:17:52 GMT
Hi all,

I recently started using Swish-E for indexing some HTML content. The
indexing works just fine, but I'm still struggling with the search part
using the command line.

In the HTML I index, there are a lot of HTML entities embedded. So far, no
problem as everything indexes just fine.

However, if I want to do a search, the command line doesn't accept html
entities in the search string, but requires the original unicode characters.
Is there a way to have it accept HTML entities for searching?

An example:

The document that get's indexed looks as follows:

<html>
<head>
    <title>beInformed 1.0</title>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
    <p>&#12400;&#12435;&#12372;&#12399;&#12435;</p>
</body>
</html>

The search command I tried is as follows:

C:\>swish-e -w "&#12400;&#12435;&#12372;&#12399;&#12435;"
# SWISH format: 2.4.1
# Search words: &#12400;&#12435;&#12372;&#12399;&#12435;
# Removed stopwords:
err: no results
.

Is there a way to make this work? I don't want to use the native characters
in the command line (they are Japanese)...

Thanks in advance,


pieter
Received on Thu Mar 25 07:17:52 2004