Skip to main content.
home | support | download

Back to List Archive

Re: Stopwords/Stemming

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sat Jun 04 2005 - 19:14:17 GMT
Brad Miele scribbled on 6/3/05 2:40 PM:

> So it seems like it is stemming the word and then comparing it against the
> stopwords. Does this seem like a correct assesment?

Yes. If stemming is used, then the text is parsed into non-whitespace, then 
stemmed, then compared against the stopwords list. Same is true for indexing and 
searching.

Example. See how the text is split on whitespace, then the -ing endings are 
stripped off. And notice that the stopword 'trust' (from 'trusting') is tagged 
as a stopword and not indexed (...though the word count is misleading... that 
seems like a bug to me...).

karpet@cartermac 16% swish-e -c c -i f.html -v3 -T parsed_words -T indexed_words 
parsed_text
Parsing config file 'c'
Indexing Data Source: "File-System"
Indexing "f.html"

Checking file "f.html"...
   f.html - Using DEFAULT (HTML2) parser - my hoping and trusting feeling
White-space found word 'my'
     Adding:[1:swishdefault(1)]   'my'   Pos:5  Stuct:0x9 ( BODY FILE )
White-space found word 'hoping'
     Adding:[1:swishdefault(1)]   'hope'   Pos:6  Stuct:0x9 ( BODY FILE )
White-space found word 'and'
     Adding:[1:swishdefault(1)]   'and'   Pos:7  Stuct:0x9 ( BODY FILE )
White-space found word 'trusting'
     Adding:[1:swishdefault(1)]   'trust'   Pos:8  Stuct:0x9 ( BODY FILE )
White-space found word 'feeling'
     Adding:[1:swishdefault(1)]   'feel'   Pos:9  Stuct:0x9 ( BODY FILE )
  (5 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 5 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
5 unique words indexed.
4 properties sorted.
1 file indexed.  61 total bytes.  5 total words.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!

karpet@cartermac 17% cat c
StopWords trust
FuzzyIndexingMode Stemming_en1

karpet@cartermac 18% cat f.html
<html>
<body>
my hoping and trusting feeling
</body>
</html>

karpet@cartermac 19% swish-e -w trust
# SWISH format: 2.5.4
# Search words: trust
# Removed stopwords: trust
err: All search words too common to be useful
.
karpet@cartermac 20% swish-e -T index_words

-----> WORD INFO in index index.swish-e <-----

and [1 1 1 (7/9)]

feel [1 1 1 (9/9)]

hope [1 1 1 (6/9)]

my [1 1 1 (5/9)]



-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Sat Jun 4 12:14:29 2005