Re: [swish-e] problem with encodings

From: Peter Karman <peter(at)>
Date: Fri Nov 09 2007 - 22:05:16 GMT
On 11/09/2007 03:51 PM, Bill Moseley wrote:
> On Thu, Nov 08, 2007 at 10:24:29PM -0600, Peter Karman wrote:
>>> I'm new to Swish-e and I think it's a great tool. Unfortunately I ran into a
>>> little problem. I indexed a collection of xml files which are encoded in
>>> Windows-1251. Then I wrote a small cgi script and I started sending queries to
>>> Swish-e. All was great except one thing. A pretty normal Windows-1251
>>> character '?' is considered by Swish-e for a word delimiter but it's not. I'll
>>> appreciate any help.
>>> I'm on Windows XP SP 2 a I have installed Swish-e 2.4.5.
>>> Best regards,
>>> Nikola
>> You likely need to adjust your WordCharacters setting to include the relevant
>> 1251 characters. By default is it Latin1 (iso-8859-1).
> Peter, is there a way to tell libxml2 that the content is 1251?

iirc, libxml2 looks at the content-type header if it is html. If xml, it uses
the <?xml ...?> content declaration.

fwiw, libswish3 checks the LANG and LC_CTYPE env vars and falls back on that if
the encoding is not declared in the document. libxml2 doesn't do that for you,

