Brad Miele scribbled on 6/3/05 2:40 PM:
> So it seems like it is stemming the word and then comparing it against the
> stopwords. Does this seem like a correct assesment?
Yes. If stemming is used, then the text is parsed into non-whitespace, then
stemmed, then compared against the stopwords list. Same is true for indexing and
searching.
Example. See how the text is split on whitespace, then the -ing endings are
stripped off. And notice that the stopword 'trust' (from 'trusting') is tagged
as a stopword and not indexed (...though the word count is misleading... that
seems like a bug to me...).
karpet@cartermac 16% swish-e -c c -i f.html -v3 -T parsed_words -T indexed_words
parsed_text
Parsing config file 'c'
Indexing Data Source: "File-System"
Indexing "f.html"
Checking file "f.html"...
f.html - Using DEFAULT (HTML2) parser - my hoping and trusting feeling
White-space found word 'my'
Adding:[1:swishdefault(1)] 'my' Pos:5 Stuct:0x9 ( BODY FILE )
White-space found word 'hoping'
Adding:[1:swishdefault(1)] 'hope' Pos:6 Stuct:0x9 ( BODY FILE )
White-space found word 'and'
Adding:[1:swishdefault(1)] 'and' Pos:7 Stuct:0x9 ( BODY FILE )
White-space found word 'trusting'
Adding:[1:swishdefault(1)] 'trust' Pos:8 Stuct:0x9 ( BODY FILE )
White-space found word 'feeling'
Adding:[1:swishdefault(1)] 'feel' Pos:9 Stuct:0x9 ( BODY FILE )
(5 words)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
5 unique words indexed.
4 properties sorted.
1 file indexed. 61 total bytes. 5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
karpet@cartermac 17% cat c
StopWords trust
FuzzyIndexingMode Stemming_en1
karpet@cartermac 18% cat f.html
<html>
<body>
my hoping and trusting feeling
</body>
</html>
karpet@cartermac 19% swish-e -w trust
# SWISH format: 2.5.4
# Search words: trust
# Removed stopwords: trust
err: All search words too common to be useful
.
karpet@cartermac 20% swish-e -T index_words
-----> WORD INFO in index index.swish-e <-----
and [1 1 1 (7/9)]
feel [1 1 1 (9/9)]
hope [1 1 1 (6/9)]
my [1 1 1 (5/9)]
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
Received on Sat Jun 4 12:14:29 2005