Skip to main content.
home | support | download

Back to List Archive

Re: searching for words with an Apostrophe

From: <moseley(at)not-real.hank.org>
Date: Fri Aug 08 2003 - 17:09:28 GMT
On Fri, Aug 08, 2003 at 03:18:19AM -0700, Sean Downey wrote:
> Hello
> 
> Is it possible to search for words with an apostrophe??
> e.g. o'reilly
> 
> I did a search for -O'Reilly- but is said  No Results.
> When I search for Reilly it returned 66 results one of which was
> "The vendor is Gerry O'Reilly, who owns the ........."

Are you sure about that?

moseley@bumby:~$ cat 1.txt
The vendor is Gerry O'Reilly, who owns the

moseley@bumby:~$ swish-e -i 1.txt -v0

moseley@bumby:~$ swish-e -w "o'reilly" -H0
1000 1.txt "1.txt" 43

moseley@bumby:~$ swish-e -w "o reilly" -H0
1000 1.txt "1.txt" 43

moseley@bumby:~$ swish-e -w "reilly" -H0
1000 1.txt "1.txt" 43

Since the apostrophe is not part of the "WordCharacters" setting it's 
indexing O'Reilly as *two* words.  So this happens also:

moseley@bumby:~$ swish-e -w "reilly o" -H0
1000 1.txt "1.txt" 43

In all cases above you are searching for "reilly AND o".

moseley@bumby:~$ swish-e -w "reilly o" -H9 | grep -i parsed
# Parsed Words: reilly o 

moseley@bumby:~$ swish-e -w "reilly'o" -H9 | grep -i parsed
# Parsed Words: reilly o 

You can add the apostrophe to WordCharacters, but then you have to worry 
about things like "Sean's" not being found by a search for "Sean".

If you want to make sure you find just the "O'Reilly" entries (not 
documents with Reilly and the letter o) then you can use a phrase 
search to force swish to find the word "o" directly before the word 
"reilly".

It would be nice to strip off those 's (as in Sean's), but sometimes you 
might want to search only for that form.

Another option from adding the apostrophe to WordCharacters would be to 
add O'Reilly to a Buzzwords list.  But see the note that follows:

moseley@bumby:~$ cat c
BuzzWords o'reilly
IgnoreLastChar ,.

moseley@bumby:~$ swish-e -i 1.txt -c c -v0

moseley@bumby:~$ swish-e -w "O'Reilly"    
# SWISH format: 2.4.0-pr1
# Search words: O'Reilly
# Removed stopwords: 
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.071 seconds
1000 1.txt "1.txt" 13
.

moseley@bumby:~$ swish-e -w "O Reilly"    
# SWISH format: 2.4.0-pr1
# Search words: O Reilly
# Removed stopwords: 
err: no results
.

moseley@bumby:~$ swish-e -w "Reilly"
# SWISH format: 2.4.0-pr1
# Search words: Reilly
# Removed stopwords: 
err: no results
.

Now, and this probably needs to be documented better (I didn't look), 
but that requires additional configuration settings IgnoreLastChar as 
shown above.

Under normal indexing words are split on the non-WordCharacters -- that 
is, you can imagine basically that all characters that are not listed in 
WordCharacters are converted to spaces leaving the tokenized (into 
words) text.

So since both the comma and the apostrophe are not in Word chars your 
sample:
 
  Gerry O'Reilly, who 

gets converted into:

  Gerry O Reilly who 

But BuzzWords works before that step -- it's based on white spaced 
words.  But in that example the white-spaced "word" is:

    O'Reilly,

and that will not match your buzzword setting of O'Reilly.  (BTW case is 
not important).  So to fix that you use the IgnoreLastChar setting which 
removes the comma before checking for a buzzword match.

Does all that make sense?


-- 
Bill Moseley
moseley@hank.org
Received on Fri Aug 8 17:09:49 2003