Re: pdftotext - erroring out

From: intervolved none <intervolved(at)>
Date: Fri Oct 25 2002 - 18:43:12 GMT
I made the change that you suggested in swishspider and in my test program it works fine.  I checked my webserver (IHS/apache) and there is a file for the mime types and PDF is listed but there seems to be a problem with it associating pdf with application/pdf.  I will check on that later.  It does not cause me problems but I if I install an upgrade it might later break other things.
 Bill Moseley <> wrote:At 01:20 PM 10/24/02 -0700, intervolved none wrote:


I have tested swish-e by indexing the files using -fs and -http.
The pdf will be indexed fine if I use -fs. If I try to index it by
using -http it will not and I will get the error message that the PDF
file is damaged. In both methods I am indexing the same file.


Ok, then the problem is likely that it's being written to disk in ascii
mode and the \n are being converted to \r\n on Windows.

Do you have a URL for one of your pdf files I can try?

Now, how are you fetching the file? When I try to fetch a .pdf file with
-S http it's rejected because it's not text/*. Is it possible that your
server is not returning the correct mime type?

E:\Program Files\SWISH-E2.2>swish-e -c c -S http -i


Parsing config file 'c'

Indexing Data Source: "HTTP-Crawler"

Indexing "http://mardy/xfig.pdf"

Returned 0

retrieving http://mardy/xfig.pdf (0)...

Returned 0

Skipping http://mardy/xfig.pdf: Wrong content type: application/pdf.

Removing very common words...

no words removed.

Writing main index...

err: No unique words indexed!


The way -S http works is it calls an external Perl program "swishspider"
and that program writes the contents to disk for swish to index. Swish
doesn't read that file in when you are using a filter, but instead passes
your filter the name of that (temp) file.

I suspect you could fix this by inserting a single line in swishspider:

binmode CONTENTS; # add this line

print CONTENTS $$content_ref;

But I'm still not sure how the file is actually getting to swish and not
being rejected as the wrong content type.


Bill Moseley

