Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Anthony Baratta <Anthony(at)not-real.2plus2partners.com>
Date: Fri Sep 24 2004 - 19:05:31 GMT
Bill Moseley wrote:
> 
> The "default" setup does use the keep_alive feature.  keep_alive, as
> I'm sure you know, allows multiple requests over the same TCP
> connection to the server -- so using keep alives saves the connection
> overhead and the time that each server process waits after closing the
> connection.

OK - I found the "map" setup for default in spider.pl. So I can tweak 
with abandonment. ;-)

I think many of my PDFs are horribly screwed up so that pdftotext can 
not read them correctly. If the spider.pl does not lock up during the 
scan, I'm starting to see a ton of these type of error messages:

  -Skipped http://local.dev.port.com/pdf/real_ccr.pdf due to 
'filter_content' user supplied function #1 death 'Skipping
    http://local.dev.port.com/pdf/real_ccr.pdf
    due to content type: application/pdf may be binary'

I have not been able to capture when this error first occurs but it 
appears that after it shows up once, it fails to attempt to index every 
PDFs found there after with the same type of error message.

Any clues?
Received on Fri Sep 24 12:05:52 2004