On Thu, Oct 05, 2006 at 07:19:47AM -0700, Peter Karman wrote:
> not strictly Swish-related, but wondering how others of you implement the 'did
> you mean...' feature in their web search apps. Do you use a custom thesaurus?
> Dictionary? etc.
I've done a dictionary lookup before using Text::Aspell. I created a
dictionary for each meta name so only words in the index would be
returned in spelling suggestions.
Hum, I've got this module floating around -- maybe it's old, as I
thought I had a version that used SWISH::API to determine "swish
words".
I've got other code for doing spelling and re-displaying, but it's
very ugly and would take me a while to read it off the punch card
backups.
SYNOPSIS
use LII::SpellCheck;
# caches open dictionary handle
my $speller = LII::SpellCheck->new(
dictionary => $dict_path,
stopwords => \@word_list,
wordcharacters => $valid_word_characters,
max_words => $max_words_to_return,
);
# later
my $words = $speller->check( $query );
$words is an array of hashes
DESCRIPTION
This module takes a string of text and looks up words in the Apsell
dictionary pointed to by $dict_path. The words are split into "swish"
words based upon the stopwords and wordcharacters passed in. Wordchar-
acters are the valid characters that can be in a word indexed by swish.
Keep in mind that a dictionary is flat, where a swish index is really
many indexes. This has to be considered when creating the GNU Aspell
dictionary.
METHODS
new( \%config )
The new() method returns a new object that caches an open dictionary.
The method will die on errors. This should be trapped by the caller.
Parameters are passed as a hash (or ref to a hash). All are required
except where noted.
Parameters are:
dictionary
This lists the full path to the GNU Aspell dictionary file.
dictionary => '/path/to/dictionary',
stopwords
This is an array reference of stopwords -- words to ignore while
spelling.
stopwords => [ $swish->HeaderValue( $index, 'stopwords' ) ],
wordcharacters
This is a list of valid characters in the 8859-1 encoding used for
words
wordcharacters => $swish->HeaderValue( $index, 'wordcharacters' ),
See CAVEATS below about limitations in how wods are split.
max_words
Sets the maximum number of word suggestions to return for each
incorrect word. The default is four.
This option is not required.
check
This method checks a string for words not found in the dictionary. The
string is split into words and non-words. Words that are not stop
words or the list of swish operators (and or not) will be checked.
Returns an array of hashes. The array is the string passed in tok-
enized into "swish_words" and non-swish_words. Each element has one or
more of the following keys:
word
This key is the original text from the string passed into check.
It may contain text or blank.
isword
This is true if the word is considered a "swish_word" (i.e. is made
up of wordcharacter characters). This will include stopwords and
swish-e operators.
unknown
This is true for words that could be in the swish-e index, but
could not be spell checked because they contain non-alpha charac-
ters.
suggestions
This is an array reference of word suggestions. If an empty array
the word was still not found in the dictionary, but the dictioanry
offered no suggestions.
CAVEATS
The string of words (i.e. query) passed to the check() method has to be
converted into "swish words" before the dictionary is searched. This
means throwing out stopwords and splitting words based on how Swish-e
splits words while indexing.
Swish-e does provide a "Parsed Words" header that has the input query
converted into "swish words", but it cannot be used when searching a
stemmed index (since the parsed query is stemmed). It also means that
some words would not show up when re-displaying the query with cor-
rected spelling to the user.
So, this module must try and emulate how swish would parse words, and
is why stopwords and wordcharacters is passed in. Unfortunately, swish
uses more than just those two items to generate "swish words" meaning
the conversion will not always match how swish parses.
There's two options, though. One would be to use Parsed Words output
from Swish -- but means running a second query on a non-stemmed index.
The other would be to expose in the swish API a method to access "swish
words".
AUTHOR
Bill Moseley <moseley@hank.org>
COPYRIGHT
This module is Copyright (c) 2005 Bill Moseley.
You may distribute under the terms of either the GNU General Public
License or the Artistic License, as specified in the Perl README file.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu Oct 5 08:39:05 2006