C:\SWISH-E>swish-e -S http -c conf/siteindex.config -v 3
Parsing config file 'conf/siteindex.config'
Parsing config file 'C:/Swish-E/conf/Settings.config'
Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp' will be
over
ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'
Indexing Data Source: "HTTP-Crawler"
Indexing "http://localhost"
Returned 0
retrieving http://localhost (0)...
Returned 0
- Using DEFAULT (HTML2) parser - (23 words)
retrieving http://localhost/affidavit.pdf (1)...
Returned 0
- Using DEFAULT (HTML2) parser - Error: Couldn't open file
'C:\Inetpub\Indexes\
Temp\swishspider@3840.contents-'
c:\SWISH-E\filter-bin\_pdf2html.pl: Failed close on pipe to pdfinfo for
C:\Inetp
ub\Indexes\Temp\swishspider@3840.contents-: 256 at
c:\SWISH-E\filter-bin\_pdf2ht
ml.pl line 54.
(no words indexed)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 19 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
19 unique words indexed.
4 properties sorted.
2 files indexed. 4335 total bytes. 23 total words.
Elapsed time: 00:00:03 CPU time: 00:00:03
Indexing done!
My configuration file looks like this:
# Include our site-wide configuration settings:
#IncludeConfigFile D:/ProgramFiles/Swish-E/conf/Settings.config
IncludeConfigFile C:/Swish-E/conf/Settings.config
# Specify the URL (or URLs) to index:
#IndexDir http://www.hr.msu.edu/hrsite
IndexDir http://localhost
# If a server goes by more than one name you can use this directive:
# EquivalentServer http://swish-e.org http://www.swish-e.org
MaxDepth 10
# The number of seconds to wait between issuing
# requests to a server. The default is 60 seconds.
Delay 1
TmpDir C:/Inetpub/Indexes/Temp
# The "http" method uses a perl helper program to fetch each document
# from the web called "swishspider" and is included in the src directory of
# the swish-e distribution.
SpiderDirectory C:/Swish-E
# Put the index files in the Inetpub/Indexes directory
#IndexFile D:/Inetpub/Indexes/SiteIndex.New.index
IndexFile C:/Inetpub/Indexes/SiteIndex.index
# Use the file filter to index pdf files
#FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl "'%p' -"
FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl
FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe "'%p'"
# Filter Directory
#FilterDir C:/SWISH-E/filters
# end of SiteIndex Config file
My Settings.config file looks like this:
# These settings tell swish what defines a word.
WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-
IgnoreFirstChar .-
IgnoreLastChar .-
# Finally, resulting words must begin/end with one
# of the characters listed here
BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789
EndCharacters abcdefghijklmnopqrstuvwxyz0123456789
# Turn this on for a slight performance improvement
#FollowSymLinks yes
IndexReport 2
#IgnoreWords file: D:/ProgramFiles/Swish-E/conf/stopwords/english.txt
IgnoreWords file: C:/Swish-E/conf/stopwords/english.txt
TranslateCharacters :ascii7:
BumpPositionCounterCharacters |.
As you can see it's pretty standard and all the html pages on my site are
indexed with no problem. I have ActiveState Perl installed on the system.
Any ideas where I've gone wrong?
Rick
Richard Klingensmith
MSU Human Resources Information Systems
1407 S. Harrison Road Ste. 40
East Lansing, MI 48823
(517) 432-4636 ext. 155
klingensmith@hr.msu.edu
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Jul 25 13:27:27 2003