Skip to main content.
home | support | download

Back to List Archive

FW: Re: More Trouble with Filters

From: Klingensmith, Rick <klingensmith(at)not-real.hr.msu.edu>
Date: Tue Jul 29 2003 - 13:19:36 GMT
Bill and All,

I'm probably beginning to sound like a flake, but I've got myself very
confused at this point. I've used the following config file and added a bare
use lib line to the swishspider file:

# ----- SiteIndex.config - Spider using "http" method -------
#
#---------------------------------------------------

IncludeConfigFile C:/Swish-E/conf/Settings.config

IndexDir http://localhost

MaxDepth 10

Delay 1

TmpDir C:/Inetpub/Indexes/Temp

SpiderDirectory C:/Swish-E

IndexFile C:/Inetpub/Indexes/SiteIndex.index

# Use the file filter to index pdf files
#FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl '"%p" -'
#FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe '"%p" -'

# Filter Directory
FilterDir C:/SWISH-E/filter-bin

# end of SiteIndex Config file


Swishspider is in my SWISH-e directory. With this configuration the pdf
files indexed correctly, but I'm still getting the same output on the meta
tags as below in my previous post.


This is the output before the pdf contents:

retrieving http://localhost/affidavit.pdf (1)...
spider 2644 [C:/Inetpub/Indexes/Temp/swishspider@2604
http://localhost/affidavit.pdf]


This is the output after the pdf contents with the contents the same as the
previous post below:

Returned 0
 - Using DEFAULT (HTML2) parser -  (279 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 135 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
135 unique words indexed.
4 properties sorted.
2 files indexed.  4412 total bytes.  302 total words.
Elapsed time: 00:00:03 CPU time: 00:00:03
Indexing done!

I thought I was using the SWISH::Filter by default, but now I'm not sure.
When I use the FileFilter directive in the config file I get the errors that
pdf is invalid. Once I commented both lines out at least it indexed the pdf
without error. The FilterDir directive doesn't seem to matter I get the same
output with or without it. I did confirm that the document is being indexed
with a search for words that only appear in the pdf with the correct
results. 

My perl/site/lib/swish subdirectory contains filter.pm and
perl/site/lib/swish/filters contain the other filter modules. I'm convinced
this is a simple configuration issue, but my perl knowledge is limited so
debugging has been a problem. 

Thanks for the help.

Rick
klingensmith@hr.msu.edu


-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org] 
Sent: Monday, July 28, 2003 5:58 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: More Trouble with Filters

On Mon, Jul 28, 2003 at 02:19:54PM -0700, Klingensmith, Rick wrote:
> I'm continuing to have a problem with filters. I'm in a windows 2000/XP
> environment and am using the spider to crawl my site which contains pdf
> files. Pdfinfo and pdftotext are installed and working from the command
> line. 

That's good thing to know.

> For each pdf file indexed I receive the following error:

> Error (0): PDF file is damaged - attempting to reconstruct xref table...
> 
> Error: Couldn't find trailer dictionary
> 
> Error: Couldn't read xref table

Those are all messages coming from xpdf.  So the next step is to modify 
whatever is calling pdfinfo/pdftotext and see how it's being called.

> I modified swishspider at line 144 to print the contents to stderr and
> receive the following output for the meta tags for the document. As you
can
> see below I believe the meta tags from the output from pdfinfo are not
being
> formed properly. I just can't figure out why.

> <html>
> 
> <head>
> 
> ">eta name="author" content="jamin
>
> ">eta name="creationdate" content="04/23/03 10:40:15
>
> ">eta name="creator" content="Affidavit final.doc - Microsoft Word
>
> ">eta name="encrypted" content="no
>
> ">eta name="file_size" content="31838 bytes
>
> ">eta name="moddate" content="04/23/03 10:47:36

That's weird output.  Looks like it's dropping some characters and 
there's an extra blank line.  Maybe DOS line endings are causing a 
problem?

Hum, ok so you are using -S http with swishspider.  Are you using the 
SWISH::Filter module(s) to decode the pdf?  Or are you using a 
FileFilter directive (although I'm not sure that works).

If using the SWISH::Filter setup then I just added a use lib line to the 
swishspider file to find the modules and ran:

moseley(at)not-real.bumby:~/apache$ ./swishspider swish http://localhost/apache/test.pdf

moseley@bumby:~/apache$ head swish.contents
<html>    
<head>
<meta name="author" content=" ">
<meta name="creationdate" content="Fri Mar 21 21:42:23 2003">
<meta name="creator" content="Microsoft Word: AdobePS 8.7.3 (301)">
<meta name="encrypted" content="no">
<meta name="file_size" content="32194 bytes">
<meta name="moddate" content="Fri Mar 21 21:42:23 2003">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">


Can you duplicate that under Windows?



-- 
Bill Moseley
moseley@hank.org
Received on Tue Jul 29 13:20:05 2003