Skip to main content.
home | support | download

Back to List Archive

Re: Exploring Swish possibilities

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Oct 01 2002 - 13:02:08 GMT
At 05:09 AM 10/01/02 -0700, Kristaps Erglis wrote:
>Thanks Bill!
>
>> What are you trying to accomplish?  Might be easier to give you an answer
>> if you describe the problem you are trying to solve.
>
>I already have my own result highlighting system written in PHP and my only
>thoughts was about how much computing power I could save on this task if
>highlights come out from SWISH runtime not PHP script.

I agree.  I'd like to see such an ability.

The highlighting code I have in perl is really slow.  It's not perl that is
slow, it just has to do a lot of work to be accurate.  If you don't want to
be so accurate (don't highlight phrases, allow false-positives, don't worry
about stopwords) then it can be much faster with simple regex substitution.
 That's why there's more than one highlighting perl module I added in the
distribution.

In my code I use the WordCharacters, IgnoreFirstChar, IgnoreLastChar,
Buzzwords, Stopwords, and Parsed Words headers from the search to parse the
source document just like swish does when indexing, except I parse it into
an array of 

 <swish word>
 <non-swish-word text or stopword>
 <swish-word>
 <non-swish-word text or stopword>....

Then I arrange the Parsed Words into meta names, and then into separate
phrases.  I sort the phrases into reverse length order so I highlight the
longest phrases first.  Then I walk the array looking for phrases (which
can be a single word) and mark the array for sections that need to be
highlighted.

Then the array is walked again looking for windows to display (a few words
on either side of the highlighted phrase).  Then add "..." and print it out.

It's more accurate than what google does, but doesn't quite have the
demands that google has, either! ;)

I could optimize by pre-parsing the source docs.  But it's easier to use a
faster CPU.

If you search for a single word swish could possible tell you the word
positions for each hit.  But it becomes more complicated with complex
queries that might contain multiple metanames.  For one thing, position
data is not stored for every word hit -- just the last one so that phrases
can be matched.  

And you can imagine how much data swish would have to return for a query like:

    desc=(apples or oranges not grapes) \ 
       or title=(food or health) \ 
       and category=(business or farming or "organic farming")












-- 
Bill Moseley
mailto:moseley@hank.org
Received on Tue Oct 1 13:05:52 2002