Try something simple first.
> cat c
propertynames foo description
> cat 1.html
<meta name="foo" content="bar">
hello
<META NAME="DESCRIPTION"
CONTENT="UK based
Wine Shop">
> ./swish-e -c c -i 1.html -T indexed_words properties
Indexing Data Source: "File-System"
Indexing "1.html"
Adding:[swishdefault:1] 'bar' Pos:2 Stuct:0x81 ( META FILE )
Adding:[swishdefault:1] 'hello' Pos:4 Stuct:0x1 ( FILE )
Adding:[swishdefault:1] 'uk' Pos:6 Stuct:0x81 ( META FILE )
Adding:[swishdefault:1] 'based' Pos:7 Stuct:0x81 ( META FILE )
Adding:[swishdefault:1] 'wine' Pos:8 Stuct:0x81 ( META FILE )
Adding:[swishdefault:1] 'shop' Pos:9 Stuct:0x81 ( META FILE )
swishdocpath: 6 ( 6) S: "1.html"
swishdocsize: 8 ( 4) N: "0000000000110"
swishlastmodified: 9 ( 4) D: "2001-11-16 10:34:12"
foo:10 ( 3) S: "bar"
description:11 ( 30) S: "UK based Wine Shop"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 6 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
6 unique words indexed.
6 properties sorted.
1 file indexed. 110 total bytes.
Elapsed time: 00:00:00 CPU time: 00:00:00
Indexing done!
> ./swish-e -w not dkdk -p foo description
# SWISH format: 2.1-dev-24
# Search words: not dkdk
# Number of hits: 1
# Search time: 0.000 seconds
# Run time: 0.005 seconds
1000 1.html "1.html" 110 "bar" "UK based Wine Shop"
.
I've always wondered what, if anything, should be done with that extra
white space.
>TmpDir /tmp/
>SpiderDirectory ./
>Delay 0
>MaxDepth 1
Use -S prog spider.pl for a faster spider.
>IgnoreLimit 80 1000
Don't use IgnoreLimit (see the 2.1-dev docs) more than once ;)
>IndexComments 0
>IndexContents HTML .lml .htm .html
If you are parsing html, then consider using libxml2 parser. It's more
accurate.
>IgnoreWords File: swish-stopwords.txt
I'm starting to think stopwords are bad, in general. My list is about five
words long.
>IndexDir http://www.bbr.com/gb.lml
>IndexFile index.tmp
Bill Moseley
mailto:moseley@hank.org
Received on Fri Nov 16 18:56:37 2001