Skip to main content.
home | support | download

Back to List Archive

Re: Problems indexing PDF files using HTTP crawler

From: Rosalyn Hatcher <r.s.hatcher(at)not-real.reading.ac.uk>
Date: Mon Jan 09 2006 - 11:38:27 GMT
Bill Moseley wrote:

>Try updating xpdf, perhaps.  Looks like it's not able to process that
>pdf file.  Fetch the file and then try:
>
>    pdfinfo Report05.pdf
>    pdftotext Report05.pdf -
>
>and see if those work directly.
>  
>
>I have no problem with it:
>
>$ /usr/local/lib/swish-e/spider.pl default http://prism.enes.org/Publications/Reports/Report05.pdf > /dev/null
>/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'
>
>Summary for: http://prism.enes.org/Publications/Reports/Report05.pdf
>         Connection: Close:      1  (0.1/sec)
>               Total Bytes: 72,475  (10353.6/sec)
>                Total Docs:      1  (0.1/sec)
>               Unique URLs:      1  (0.1/sec)
>application/pdf->text/html:      1  (0.1/sec)
>
>  
>
The pdftotext and pdfinfo were fine and i also got a success with the 
spider command you
used above with the default configuration.  Bizarre, since I was sure 
I'd tried to use the
default before and still got the error - guess I must have done that 
incorrectly!

Consequently, I decided it must be my config file so ditched it and 
started again.  The problem line in my
swish.conf file was

FileFilter .pdf pdftotext "'%p' -"

Once that was removed all seems to work ok.  Not sure I understand why 
this line
isn't needed as my internet searches indicated that it was.

Thanks for your help,
Rosalyn.

-- 
------------------------------------------------------------------------
Rosalyn Hatcher
CGAM, Dept. of Meteorology, University of Reading, 
Earley Gate, Reading. RG6 6BB
Email: r.s.hatcher@reading.ac.uk     Tel: +44 (0) 118 378 7841
Received on Mon Jan 9 03:38:32 2006