Skip to main content.
home | support | download

Back to List Archive

Re: words, entities and accents

From: Jose Manuel Ruiz <jmruiz(at)>
Date: Fri Jun 09 2000 - 06:50:18 GMT
Hi, Philip wrote:
> Swish-e 1.3.2 doesn't index documents containing HTML entities properly.
> Because WORDCHARS doesn't contain ';', a word like "montr&eacute;al" is
> indexed as two words "montré" and "al".  I cured this temporarily by adding
> ';' to WORDCHARS.

You are right, but you can also add ';' to WordCharacters in your config

> While digging in index.c to discouver this, I see the following comment :
>        /* Ok, can now go to lowercase, the whole problem
>           was with entities &Aacute; would become &aacute;
>         */
> I find this strange, because it's EXACTLY what I want.  Otherwise,
> "R&Eacute;SEAU" becomes "rÉseau", and won't be found if you search for
> "réseau".  While I realise that I should be using locales so that
> tolower() does the right thing, I'd rather not go there.
> Is there ever a case where it is undesirable for an HTML entity to be
> converted to lower case as-is?  Seeing as how we going to convert it to
> lower case after converting to an ISO-latin-1 char anyway.
> Also, it seems to me a better idea to convert all entities *before*
> looking for word boundaries.  This means that "&;" can be removed from
> WORDCHARS.  Is there any particular reason this isn't done now?

Entities are converted in convertentities function (wich also calls to
cnverttonamed and concerttoascii).
Convertentities is executed before going to lowercase and striping 
last an firts characters. So montr&eacute; becomes montré prior to go
to lowercase.
In my language (spanish), I also need that montré becomes montre to
errors in searching when people mispelled the words. This is why I have
the TranslateCharacters directive to the code (see my previous message
in the

> I note that the code to split words up is duplicated in 4functions
> (countwords(), countwordstr(), parsecomment() and
> parseMetaData()).  This makes things like changing the entities handling a
> tad error-prone, to say the least.  Wouldn't it be better for each
> function to look for strings that are to be counverted to words, then call
> addstring() (say), which does word spliting, entity handling and calls
> addentry()?  I'll write the code, but would like opinions first.  And, if
> anyone has a torture test or coverage test to make sure I don't break
> something, I'll be needing that....

You are right. swish-e-1.3.2 has several lacks in the coding. Take a
look at
memory, you will see many more malloc,realloc and strdup than free. If
we talk
about bufferoverrun and performance, swish-e has severe lacks in its
Fortunately, many people have worked on it. I have tried to add all
these patches
and many new features including better performance to the package. Take
a look

Have a nice day 

Jose Ruiz
Received on Fri Jun 9 02:56:31 2000