Skip to main content.
home | support | download

Back to List Archive

RE: Making Swish search like it indexes

From: Mark Gaulin <gaulin(at)not-real.globalspec.com>
Date: Thu Oct 28 1999 - 22:30:43 GMT
The word stemming option had the same problem so I found a way to add a
simple on/off flag into the index header (as a comment) and to parse it out
during searches. Something like that ought to work for the other indexing
options.

-----Original Message-----
From: swish-e@sunsite.berkeley.edu [mailto:swish-e@sunsite.berkeley.edu]On
Behalf Of Bill Moseley
Sent: Thursday, October 28, 1999 5:30 PM
To: Multiple recipients of list
Subject: [SWISH-E] Making Swish search like it indexes

Swish uses WordCharacters, BeginChars, EndChars, IgnoreFirstChar, and
IgnoreLastChar to decide what defines a word, and thus what is indexed.

For example, if you don't have a dash in WordCharacters and you have a
string in a document as: 'dashed-words', then Swish will index two separate
words 'dashed' and 'words' (or 'dash' and 'word' if using stemming).

This means you can search for either word and find the document.

The problem with Swish is that it doesn't use the same rules for searching
as it does for indexing.  So, searching for 'dashed-words' fails because
during the search Swish doesn't break the search into the two words.  It's
looking for 'dashed-words' and that's not in the index.

(Of course, you could add the dash to the WordCharacters list, but then you
couldn't search for the separate words -- but I'm not debating that here.)

One solution is to do pre-processing of the search terms before calling
Swish.  This requires reading the swish conf file used for indexing and
then splitting the words up and passing those words to Swish.  Here's a way
to pre-process in Perl that replicates how Swish indexes:

return grep {
        /^[$BeginCharacters]/o &&
        /[$EndCharacters*]$/o
    } map {
        s/^[$IgnoreFirstChar]+//o;
        s/[$IgnoreLastChar]+(?=\*?$)//o;
        $_;
    } split /[^$WordCharacters*]/o, lc $query;

Well, almost.  That returns a list instead of a scalar, and doesn't deal
properly with ( and ).

Or, to modify Swish to handle the words 'properly', in search.c there needs
to be a few changes in the search() function:

** Only SLIGHTLY tested.  **

Instead of splitting on space as Swish currently does,

if (isspace(words[i]) || words[i] == '(' || words[i] == ')' || words[i] ==
'=')

Split instead on non-WordCharacters as defined in your swish.conf file.

if ( (!iswordchar(words[i]) && words[i] != '*')  || words[i] == '(' ||
words[i] == ')' || words[i] == '=')

Now, looking at the search() function is a bit scary.  search() takes the
query and splits it into separate words placed in a linked list.  It builds
the linked list of search terms one word at a time, but in three different
places, using the line:

        searchwordlist = (struct swline *) addswline(searchwordlist,
                        (char *) convertentities(word));

The first occurrence of this line is for adding Metanames to the search
(such as when using the metaname=(word) type of search).  I don't know
what's the point of using convertentities() here, but that's another story.

The second two places are used for adding search words to the query.  These
are the locations where we can modify the search words one-by-one.   Add
the following lines right before each:

    stripIgnoreLastChars(word);
    stripIgnoreFirstChars(word);
    searchwordlist = (struct swline *) addswline(searchwordlist,
                                (char *) convertentities(word));

Then to make this all work properly, you MUST specify your swish config
file during searches with the -C option.  Otherwise Swish will use it's
defaults for WordCharacters, IgnoreFirstChar, and IgnoreLastChar.  It would
be nice if those settings were stored in the index during indexing.



Bill Moseley
mailto:moseley@hank.org
Received on Thu Oct 28 15:28:48 1999