On Fri, Aug 08, 2003 at 03:18:19AM -0700, Sean Downey wrote:
> Hello
>
> Is it possible to search for words with an apostrophe??
> e.g. o'reilly
>
> I did a search for -O'Reilly- but is said No Results.
> When I search for Reilly it returned 66 results one of which was
> "The vendor is Gerry O'Reilly, who owns the ........."
Are you sure about that?
moseley@bumby:~$ cat 1.txt
The vendor is Gerry O'Reilly, who owns the
moseley@bumby:~$ swish-e -i 1.txt -v0
moseley@bumby:~$ swish-e -w "o'reilly" -H0
1000 1.txt "1.txt" 43
moseley@bumby:~$ swish-e -w "o reilly" -H0
1000 1.txt "1.txt" 43
moseley@bumby:~$ swish-e -w "reilly" -H0
1000 1.txt "1.txt" 43
Since the apostrophe is not part of the "WordCharacters" setting it's
indexing O'Reilly as *two* words. So this happens also:
moseley@bumby:~$ swish-e -w "reilly o" -H0
1000 1.txt "1.txt" 43
In all cases above you are searching for "reilly AND o".
moseley@bumby:~$ swish-e -w "reilly o" -H9 | grep -i parsed
# Parsed Words: reilly o
moseley@bumby:~$ swish-e -w "reilly'o" -H9 | grep -i parsed
# Parsed Words: reilly o
You can add the apostrophe to WordCharacters, but then you have to worry
about things like "Sean's" not being found by a search for "Sean".
If you want to make sure you find just the "O'Reilly" entries (not
documents with Reilly and the letter o) then you can use a phrase
search to force swish to find the word "o" directly before the word
"reilly".
It would be nice to strip off those 's (as in Sean's), but sometimes you
might want to search only for that form.
Another option from adding the apostrophe to WordCharacters would be to
add O'Reilly to a Buzzwords list. But see the note that follows:
moseley@bumby:~$ cat c
BuzzWords o'reilly
IgnoreLastChar ,.
moseley@bumby:~$ swish-e -i 1.txt -c c -v0
moseley@bumby:~$ swish-e -w "O'Reilly"
# SWISH format: 2.4.0-pr1
# Search words: O'Reilly
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.071 seconds
1000 1.txt "1.txt" 13
.
moseley@bumby:~$ swish-e -w "O Reilly"
# SWISH format: 2.4.0-pr1
# Search words: O Reilly
# Removed stopwords:
err: no results
.
moseley@bumby:~$ swish-e -w "Reilly"
# SWISH format: 2.4.0-pr1
# Search words: Reilly
# Removed stopwords:
err: no results
.
Now, and this probably needs to be documented better (I didn't look),
but that requires additional configuration settings IgnoreLastChar as
shown above.
Under normal indexing words are split on the non-WordCharacters -- that
is, you can imagine basically that all characters that are not listed in
WordCharacters are converted to spaces leaving the tokenized (into
words) text.
So since both the comma and the apostrophe are not in Word chars your
sample:
Gerry O'Reilly, who
gets converted into:
Gerry O Reilly who
But BuzzWords works before that step -- it's based on white spaced
words. But in that example the white-spaced "word" is:
O'Reilly,
and that will not match your buzzword setting of O'Reilly. (BTW case is
not important). So to fix that you use the IgnoreLastChar setting which
removes the comma before checking for a buzzword match.
Does all that make sense?
--
Bill Moseley
moseley@hank.org
Received on Fri Aug 8 17:09:49 2003