Re: [SWISH-E:104] Words not getting indexed

From: Giulia Hill <ghill(at)not-real.library.berkeley.EDU>
Date: Tue Jan 06 1998 - 17:17:55 GMT

As I think you had already guessed the reason why you do not get the word
"cron" indexed is because of the presence of "}" not followed by a space
instead of the expected ">" and "<" which would not create the problem in
which you are running. 

Assuming that you do not want to change the "{" with "<" and "}" with ">",
here is a possible solution and a warning note.

In the config.h file modify the variable WORDCHARS so to contain only the
char's that you want into a word and recompile. In the default file
provided with the swish-efiles.tar pretty much all kind of char's are
included, so you might want to remove some of them. Aside from the
alphabet, you might want to include also 0123456789&#; otherwise you will
not be able to index html entities. Also check the variables BEGINCHARS
and ENDCHARS to make sure that they contain only the char's that can be at
the beginning and the end of words in your file. This will pretty much
cause all the char's that are not included in the WORDCHARS variable to be
treated as word separators. 

Now the warning note. Since you are using {} instead of <>, with the
solution that I have outlined words that are between the curly brackets
will also be indexed. For example in the file that you have in your mail,
also the words "title, ned, id, and type" will be part of your index. 
Instead, by using <> you have the choice as to the indexing of the words
in the brackets with the config.h variable INDEXTAGS.

I hope that this helps.

On Mon, 5 Jan 1998, Jeff Morrow wrote:

> Hi there.  I just set up a SWISH-E searchable database on a website in the
> School of Education.  I have an indexing problem.  Indexing the following
> file:
> {title}cron test{/title}
> {NED}Evidence Description File version 1.0, Copyright (c) 1996 KIE Research
> Grou
> p{/NED}
> {ID}E01-980102-001{/ID}
> {type}data{/type}
> etc...
> leaves me with an index which does NOT contain the word "cron" anywhere.
> This is problematic, since all the files I'm indexing are in the above
> format.  I'd like for swish-e to treat all non-letter characters as white
> space and index accordingly.  How could I do this?
> Unrelated problem: I've set my search engine to use the AND_RULE as the
> default.  However, in that case, any search containing a stopword will
> automatically produce an empty result, regardless of the other words in the
> query.  I got around this issue by setting IgnoreLimit so high that no
> stopwords exist.  I'd like a better way around this, though.  Any ideas?
> Thanks alot.  Please respond both to the list and via email.  Thanks.
> Jeff Morrow
