Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Anthony Baratta <Anthony(at)>
Date: Fri Sep 24 2004 - 22:32:50 GMT
Bill Moseley wrote:
> Can you setup a small HTML page with a few links to PDFs that I could
> spider that shows the problem?

I'm starting to think this might be a buffer overflow due to the number 
of PDFs being skipped due to size or the Error Missing Endstream errors 
we were discussing earlier.

Here's a page with 56 links to PDFs. 33 should succeed, 20 should be 
skipped (if you drop the max file size down to 1 MB) and 3 should 
"fail". The last three on the list are the failures. Something about the 
previous 53 cause the filter to blow.

I've run the spider twice locally and against the URL above. Same output 
every time.

> So that would indicate that $filter->convert is being called but it's
> not being filtered.  (Which I guess you know by now.)  You can turn on
> filter debugging by setting then environment FILTER_DEBUG to something
> true (like 1 or some text).

Found it. I thought I had that setup but I typo'd it. Sigh.

 >> Starting to process new document: application/pdf
  ++Checking filter [SWISH::Filters::Doc2txt=HASH(0x258f670)] for 

  ++ application/pdf was not filtered by 

  ++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x257df08)] for 
  (202 words)
Problems with filter 'SWISH::Filters::Pdf2HTML=HASH(0x257df08)'.  Filter 
  -> open2: IO::Pipe: Can't spawn-NOWAIT: Resource temporarily 
unavailable at C:\
Progra~1\SWISH-E\lib\swish-e\perl/SWISH/ line 1158

Final Content type for is application/pdf
   *No filters were used

>>P.S. I'm still unable to get the Descriptions to work for non-PDF pages. 
>>I've spidered the site with PDF filtering off via the test_url option 
>>and I can't get the descriptions to appear. There must be something 
>>weird about our HTML pages in order to mess up the indexer.
> Maybe.  Again, make a tiny simple HTML page and spider it and see if
> it works.  If so, then you now it's not your config.  Then try one of
> your HTML pages and see what happens.  If nothing then turn on
> ParserWarnLevel 9 in the swish config file and/or validate the page's
> html.

OK - I figured out what is going on. The "Local Time - ..." that is 
appearing in the description *is* being harvested from the html. I 
grabbed a view source from one of our pages (since 90% of the content is 
coming out of a database for many of our hub pages) and ran swish-e 
against this now flat html file.

The Local Time is embedded at the first real text in the page via a time 
function. But strangely none of the other text on the page shows up. 
Just a lot of "space" filling up the rest of the description buffer. 
We'll looks like we'll have to wrap the Local Time in those special 
comments <!-- noindex --> <!-- index --> to by pass this "text" from 

I've done a bit if testing with this route and it appears promising, 
just need to set them up so they don't interfere with the link spidering.
Received on Fri Sep 24 15:33:23 2004