Skip to main content.
home | support | download

Back to List Archive

RE: WordCharacters

From: David Norris <kg9ae(at)not-real.geocities.com>
Date: Sat Aug 28 1999 - 02:23:58 GMT
> WordCharacters  abcdefghijklmnopqrstuvwxyz0123456789_-
> BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789_-
> EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789_-
> IgnoreLastChar  )]}'.,;?!_-"
> IgnoreFirstChar ([{'_-"

For an example, these bits of text wouldn't be indexed with the above
config:
Jimbob's
exposť or expos&eacute; or expos&#233;
Scary!
Really?
Sadly.

You need to have &, #, ;, a-z, and 0-9 in your WordCharacters to catch the
SGML entities and various other character codes.  @ would probably be good
idea if you plan to index email addresses and such.  - would be good as well
as _ along with $ and % for various string and number related things.  These
might be useful for some certain types of data: "|,'"[](~!^{}+?".  Math,
Science, etc contain characters in what would be considered a word-like
structure desirable for searching.

> enabled.  First, wild card search words (words with '*' at the end) do not
> seem to get stemmed, so you don't get the results you would expect.

What behavior would you expect with wild cards?  The stemming algorithm
depends on a word with a letter ending.  And, stemming wouldn't be needed
with a wildcard in place.  The algorithm can't stem a word if that word ends
in an infinite length of random characters.  Besides, that infinite length
of random characters would likely contain the stem depending on the wildcard
placement and word.

> Second, any query that includes punctuation characters seems to just
search
> for that exact term, even if the characters are not part of the
> WordCharacters settings.  Seems like the same rules should apply to the
> search string fed into Swish as to the words that are indexed.

You need to thoroughly read the config file comments.
http://sunsite.berkeley.edu/SWISH-E/Manual/config.user.html
In summary:
	1. IgnoreLastChar must be a subset of EndCharacters
	2. IgnoreFirstChar must be a subset of BeginCharacters
	3. BeginCharacters and EndCharacters must be a subset of WordCharacters.

Otherwise it will index words incorrectly.  According to your configuration:
the string "this is an example." contains four words.  "this", "is", "an",
"example."  Notice, that the last word has punctuation mark at the end.  It
would be indexed and searched as such.  Any search for "example" would not
return "example." and vice-versa.

,David Norris

World Wide Web - http://www.webaugur.com/dave
Page via mail - 412039@pager.mirabilis.com
ICQ Universal Internet Number - 412039
E-Mail - dave@webaugur.com
Received on Fri Aug 27 19:16:00 1999