Skip to main content.
home | support | download

Back to List Archive

Re: Wildcard and Boolean search with swish

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jan 15 2004 - 16:30:52 GMT
On Thu, Jan 15, 2004 at 04:25:36PM +0100, Timo Haberkern wrote:

> >>i read in the documentation that a wildcard (*) is only allowed at the 
> >>end of a word. Is that right? Is there no workaround for that?
> >
> Hmm any plans to change this? I think i'm not the only one who would 
> like a "normal" wildcard character search...

If you mean regular expressions then that's what egrep is for.  But it's
slow for a large number of words/documents.  Swish-e can be fast at
searching because the words are organized in a way that makes them fast
to search. Searching by patters requires testing the pattern on each
word. 

There's probably some optimizations that could be done to make it
faster, but nothing like how swish-e works with its reverse index.

Anyway, this is a common question -- should be in the FAQ (if it isn't 
already).  And it was just asked a week or so ago.

See, normal searching works like this: Swish takes a word -- converts it
into a hash value which is an index into an array that contains other
words of that same hash value.  When searching swish uses the same hash
function to convert each search word into a hash value.  It then uses
that as an index into the array and then walks the array looking for
that word.  Hopefully, there are not too many entries in the array for
that hash index value and it's quick to find the matching word.  Note
that the hash value does not relate to alphabetical order.

Wildcards are kind of a hack.  (Jose, let me know if I get any of this
wrong.) To do a wildcard search swish-e has an array 256 slots long --
indexed by the first letter of the word.  At each array slot is a linked 
list of words, sorted alphabetically.  Swish-e thus needs the first 
character of the wildcard word to begin searching.  Wildcard searching 
can be very slow, as you might imagine.  You can also see how this 
system won't work if swish-e is to ever support unicode characters.

The other day I mentioned that you could likely do this inside of swish:

   foo*bar

where swish-e would search for foo* as it currently does, but filter 
results where the word ends in bar.  Still, it would be much slower than 
normal full-word searches.

Can you give some examples of wild card searches you would like to do?  
I.e. searches where you know the start and end of a word but not the 
middle?


-- 
Bill Moseley
moseley@hank.org
Received on Thu Jan 15 16:31:08 2004