Skip to main content.
home | support | download

Back to List Archive

Indexing pdf files

From: Klingensmith, Rick <klingensmith(at)not-real.hr.msu.edu>
Date: Fri Jul 25 2003 - 13:27:16 GMT
C:\SWISH-E>swish-e -S http -c conf/siteindex.config -v 3

Parsing config file 'conf/siteindex.config'

Parsing config file 'C:/Swish-E/conf/Settings.config'

 

Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp' will be
over

ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'

Indexing Data Source: "HTTP-Crawler"

Indexing "http://localhost"

Returned 0

retrieving http://localhost (0)...

Returned 0

 - Using DEFAULT (HTML2) parser -  (23 words)

retrieving http://localhost/affidavit.pdf (1)...

Returned 0

 - Using DEFAULT (HTML2) parser - Error: Couldn't open file
'C:\Inetpub\Indexes\

Temp\swishspider@3840.contents-'

c:\SWISH-E\filter-bin\_pdf2html.pl: Failed close on pipe to pdfinfo for
C:\Inetp

ub\Indexes\Temp\swishspider@3840.contents-: 256 at
c:\SWISH-E\filter-bin\_pdf2ht

ml.pl line 54.

 (no words indexed)

 

Removing very common words...

no words removed.

Writing main index...

Sorting words ...

Sorting 19 words alphabetically

Writing header ...

Writing index entries ...

  Writing word text: Complete

  Writing word hash: Complete

  Writing word data: Complete

19 unique words indexed.

4 properties sorted.

2 files indexed.  4335 total bytes.  23 total words.

Elapsed time: 00:00:03 CPU time: 00:00:03

Indexing done!

 

 

My configuration file looks like this:

 

# Include our site-wide configuration settings:

 

#IncludeConfigFile D:/ProgramFiles/Swish-E/conf/Settings.config

IncludeConfigFile C:/Swish-E/conf/Settings.config

 

# Specify the URL (or URLs) to index:

#IndexDir http://www.hr.msu.edu/hrsite

IndexDir http://localhost

 

# If a server goes by more than one name you can use this directive:

 

# EquivalentServer http://swish-e.org  http://www.swish-e.org

 

 

MaxDepth 10

 

# The number of seconds to wait between issuing

# requests to a server.  The default is 60 seconds.

 

Delay 1

 

TmpDir C:/Inetpub/Indexes/Temp

 

# The "http" method uses a perl helper program to fetch each document

# from the web called "swishspider" and is included in the src directory of

# the swish-e distribution.

 

SpiderDirectory C:/Swish-E

 

# Put the index files in the Inetpub/Indexes directory

#IndexFile D:/Inetpub/Indexes/SiteIndex.New.index

IndexFile C:/Inetpub/Indexes/SiteIndex.index

 

# Use the file filter to index pdf files

#FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl "'%p' -"

FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl

FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe "'%p'"

 

# Filter Directory

#FilterDir C:/SWISH-E/filters

 

# end of SiteIndex Config file

 

 

My Settings.config file looks like this:

 

# These settings tell swish what defines a word.

 

WordCharacters abcdefghijklmnopqrstuvwxyz0123456789.-

 

IgnoreFirstChar .-

IgnoreLastChar  .-

 

# Finally, resulting words must begin/end with one

# of the characters listed here

 

BeginCharacters abcdefghijklmnopqrstuvwxyz0123456789

EndCharacters   abcdefghijklmnopqrstuvwxyz0123456789

 

# Turn this on for a slight performance improvement

#FollowSymLinks yes

 

IndexReport 2

 

#IgnoreWords file: D:/ProgramFiles/Swish-E/conf/stopwords/english.txt

IgnoreWords file: C:/Swish-E/conf/stopwords/english.txt

 

TranslateCharacters :ascii7:

 

BumpPositionCounterCharacters |.

 

As you can see it's pretty standard and all the html pages on my site are
indexed with no problem. I have ActiveState Perl installed on the system.
Any ideas where I've gone wrong? 

 

Rick

 

Richard Klingensmith

MSU Human Resources Information Systems

1407 S. Harrison Road Ste. 40

East Lansing, MI 48823

(517) 432-4636 ext. 155

klingensmith@hr.msu.edu

 




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Fri Jul 25 13:27:27 2003