It seems that almost every site I try to index some how zaps the
final index. Example below for http://www.sugarcharity.org/
nothing unique about his site, just small
http://www.sugarcharity.org/page3.html
contains an assortment of words that are probably NOT common and
should appear in the index but do not????
"letter, applicant, pump, financial, income postal, employer", etc...
doing the index results in a file
ls -l swish.index
-rw-r--r-- 1 spider users 49723 Apr 11 13:16 swish.index
using
swish-e -V
SWISH-E 2.0
really 2.05
tmp.config contains
IndexFile ./swish.index
MetaNames author description datamodified
IndexReport 3
FollowSymLinks yes
UseStemming yes
PropertyNames author description datamodified
IgnoreTotalWordCountWhenRanking yes
MinWordLimit 4
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-_'"
#IgnoreLimit 80 1000
IgnoreWords SwishDefault
IndexComments 0
NoContents .gif .xbm .au .mov .mpg .pdf .ps .jpeg .jpg
MaxDepth 4
Delay 5
command line
swish-e -i http://www.sugarcharity.org -c tmp.config -l -v 3 -S http
results from this at the end of message
but.....
/usr/local/bin/swish-e -t HBthe -w "letter" -m 0 -f swish.index
says...
# Swish-e format 2.0
#
# Name: (no name)
# Saved as: swish.index
# Counts: 7 words
# Indexed on: 11/04/2002 13:07:47 PDT
# Description: (no description)
# Pointer: (no pointer)
# Maintained by: (no maintainer)
# DocumentProperties: Enabled
# Stemming Applied: 1
# Soundex Applied: 0
# WordCharacters: '-.0123456789_abcdefghijklmnopqrstuvwxyz
# MinWordLimit: 4
# MaxWordLimit: 40
# BeginCharacters: "&'(0123456789abcdefghijklmnopqrstuvwxyzSO
# EndCharacters: "'),.0123456789\abcdefghijklmnopqrstuvwxyzSO
# IgnoreFirstChar: "'(
# IgnoreLastChar: "'),.;
# SWISH format 2.0
err: the index file(s) is empty
HELLO!!! what is this?? why is the index reported as empty?
This is happening on many sites that have successfully indexed in the
past but now return an index file with the same size as above. It
appears that something has broken that is date related.
results of index operation
Indexing Data Source: "HTTP-Crawler"
Indexing http://www.sugarcharity.org..
retrieving http://www.sugarcharity.org (0)...
(35 words)
retrieving http://www.sugarcharity.org/index.htm (1)...
(35 words)
Skipping ...<snip>
retrieving http://www.sugarcharity.org/page2.html (1)...
(21 words)
Skipping http://www.canadianbutterfly.ca/: Wrong method or server.
retrieving http://www.sugarcharity.org/page3.html (1)...
(132 words)
retrieving http://www.sugarcharity.org/page4.html (1)...
(85 words)
Skipping ... <snip>
http://www.sugarcharity.org/page5.html (1)...
(101 words)
Skipping ... <snip>
retrieving http://www.sugarcharity.org/page6.html (1)...
(98 words)
Removing very common words...
360 words removed.
24 words removed not in common words array:
124, amp, put, 4.0, ne, ha, tax, t4, ag, sex, 495, l7t, 2x5, ask, pat,
zip, dai, 2, 00, p.m, moo, big, box, ad, Writing main index...
Computing hash table ... Writing header ... Writing index entries ...
Writing stopwords ... no unique words indexed. Writing file index...
Writing file list ... Writing file offsets ... Writing MetaNames ...
Writing offsets (2)... 7 files indexed. Running time: 37 seconds.
Indexing done!
Michael@Insulin-Pumpers.org
Received on Thu Apr 11 21:39:19 2002