Skip to main content.
home | support | download

Back to List Archive

Re: Scandinavian characters

From: Bill Moseley <moseley(at)>
Date: Wed Nov 07 2001 - 16:42:26 GMT
At 05:40 AM 11/07/01 -0800, Mikael Niku wrote:
>I succesfully set up SWISH-E on my www site. Everything
>is working well - except that the example CGI script
>provided at
>doesn't like scandinavian characters. Command-line
>SWISH-E search has no problems with that.
>I never had time to learn Perl, and couldn't fix this
>by tinkering the script code. I'd be very grateful if
>someone could help me out!

I don't think that scripts author had time to learn Perl, either ;).

>I guess the relevant part of the code is the following:
>$keystring = $FORM{'keywords'};
>if ($keystring =~ /^([\w\-\. ]+)$/ ) {
>    $topic = $1;

I suppose you have it figured out now, but you need to add in the
characters to allow between [ and ].

The reason the author is doing the above is for security, because they are
running swish in an insecure way.  If you fork/exec then you don't have to
worry about filtering chars that might have special meaning to the shell.
Any CGI security FAQ or book will discuss this, as does perldoc perlsec
(and perldoc perlipc).

You should check swish's configuration and make sure that the characters
are also in swish-e's WordCharactes, BeginCharacters and EndCharacters

Also, swish converts everything to lowercase, and that's based on the
current locale setting.

There's also a TranslateCharacters setting which can be used to map 8-bit

My advice is to make sure you are using a development snapshot (see the site), and index small sample files like

      ./swish-e -c swish.conf -i test.txt -T indexed_words

And you will see how swish is parsing words, and converting to lowercase.

Then you might look into a better CGI script.  The script that's used to
search this discussion list archive is included in the swish-e
(development) distribution.

Post again if you need more help.  And please post any issues you had to
solve so they can be documented.


Bill Moseley
Received on Wed Nov 7 16:42:52 2001