Skip to main content.
home | support | download

Back to List Archive

Re: problem indexing PDFs - "Error (0): PDF file is damaged"

From: <Brad_Horstkotte(at)not-real.capgroup.com>
Date: Wed Dec 17 2003 - 21:16:38 GMT
I tried your suggestion:

adding system("copy $file \temp.pdf") in _pdf2html.pl

..but it appeared to create no file - when I tried something similar in a
test .pl file:

system("copy test.pl test2.pl")

..it worked fine, so maybe something to do with the piping going on??

Anyway, still trying to figure out how to debug this...with very little
perl knowledge (lots of other programming languages, but not perl - guess
one of these days I'll have to read a book on it).  I saw some temporary
files being created that were left when I did a <break> during spidering,
but not sure what step of the process those were generated - in any case,
they check out fine, identical to the downloaded/saved PDFs, byte for byte,
and process fine when passed to _pdf2html.pl.

>No it's more turnkey.  If you use the "default" mode it should know how to
decode it:...

I tried doing "spider.pl default http://localhost/ | swish-e -S prog -i
stdin", and it seemed to pick up the PDFs OK, but when I ran a search on
the resulting index, all results have no descriptions, and PDFs have the
pdf file name as their title instead of the meta title.  I assume because
of how "default" is configured - if so, where is the default configuration
specified?

I'm also unclear on the connection between this particular command example
and the use of SWISH::Filter - perhaps because I haven't seen the "default"
configuration.

Thanks for your help in figuring this out - Brad



                                                                                                                                       
                      Bill Moseley                                                                                                     
                      <moseley@hank.org        To:       Brad Horstkotte/AFS/CAPITAL@CG                                                
                      >                        cc:       Multiple recipients of list <swish-e@sunsite.berkeley.edu>                    
                                               Subject:  Re: [SWISH-E] Re: problem indexing PDFs - "Error (0): PDF file is damaged"    
                      12/17/2003 09:47                                                                                                 
                      AM                                                                                                               
                                                                                                                                       
                                                                                                                                       




On Wed, Dec 17, 2003 at 09:33:59AM -0800, Brad_Horstkotte@capgroup.com
wrote:

> No, the files aren't empty - I was able to download them and run the
> PDF conversion from the command line, and the resulting text file is
> fine - its only when running the conversion as a filter while
> spidering that I get those errors - somthing to do with the temp file
> that is created and passed to the filter.

Do you have a way to compare the files to find out how they differ?


--
Bill Moseley
moseley@hank.org
Received on Wed Dec 17 21:16:48 2003