Skip to main content.
home | support | download

Back to List Archive

Problems with FileFilter .pdf

From: Gerald Klaas <gklaas(at)not-real.arb.ca.gov>
Date: Thu Sep 20 2001 - 21:59:02 GMT
I'm having problems getting the pdf filter going.

I have SWISH-E 2.0 running on RedHat Linux 6.2
I'm creating an index using the -S http to spider
a single .pdf file (just to test the filter)

from my pdftest.config file
---snip---
IndexDir http://www.arb.ca.gov/msprog/spillcon/wdec00.pdf
FilterDir   /app/swish/filter-bin/ 
FileFilter   .pdf   pdf-filter.sh 
FileFilter   .doc   doc-filter.sh 
---end snip---

from /app/swish/filter-bin/pdf-filter.sh
---snip---
#!/bin/sh 
# Adobe PDF filter 
# see: http://www.foolabs.com/xpdf/  
/usr/bin/pdftotext "$1" - 2>/dev/null 
/usr/bin/pdftotext "$1" - >/tmp/gersee 
---end snip---

when I run the index, it only indexes 2 words
---snip---
[swish@listserv adamtest]$ ../src/swish-e -S http -c pdftest.config 
Indexing Data Source: "HTTP-Crawler" 
Indexing http://www.arb.ca.gov/msprog/spillcon/wdec00.pdf.. 
retrieving http://www.arb.ca.gov/msprog/spillcon/wdec00.pdf (0)... 
 (2 words) 

Removing very common words... 
no words removed. 
Writing main index... 
Computing hash table ... 
Writing header ... 
Writing index entries ... 
Writing stopwords ... 
2 unique words indexed. 
Writing file index... 
Writing file list ... 
Writing file offsets ... 
Writing MetaNames ... 
Writing offsets (2)... 
1 file indexed. 
Running time: 1 second. 
Indexing done! 
---end snip---

At this point, the /tmp/gersee file that is created
in line 5 of the pdf-filter.sh does contain the text
conversion of the pdf file, but for some reason the
STDOUT of the pdftotext filter isn't making it back
into the swish indexing.  

Can anyone tell me what I'm missing here?

Thanks,
Gerald
Received on Thu Sep 20 22:02:49 2001