Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Anthony Baratta <Anthony(at)not-real.2plus2partners.com>
Date: Fri Sep 24 2004 - 22:32:50 GMT
Bill Moseley wrote:
> 
> Can you setup a small HTML page with a few links to PDFs that I could
> spider that shows the problem?

I'm starting to think this might be a buffer overflow due to the number 
of PDFs being skipped due to size or the Error Missing Endstream errors 
we were discussing earlier.

Here's a page with 56 links to PDFs. 33 should succeed, 20 should be 
skipped (if you drop the max file size down to 1 MB) and 3 should 
"fail". The last three on the list are the failures. Something about the 
previous 53 cause the filter to blow.

	http://test.portofoakland.com/PDF_TestPage.html

I've run the spider twice locally and against the URL above. Same output 
every time.

> So that would indicate that $filter->convert is being called but it's
> not being filtered.  (Which I guess you know by now.)  You can turn on
> filter debugging by setting then environment FILTER_DEBUG to something
> true (like 1 or some text).

Found it. I thought I had that setup but I typo'd it. Sigh.

 >> Starting to process new document: application/pdf
  ++Checking filter [SWISH::Filters::Doc2txt=HASH(0x258f670)] for 
application/pdf

  ++ application/pdf was not filtered by 
SWISH::Filters::Doc2txt=HASH(0x258f670)

  ++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x257df08)] for 
application/pdf
  (202 words)
Problems with filter 'SWISH::Filters::Pdf2HTML=HASH(0x257df08)'.  Filter 
disabled:
  -> open2: IO::Pipe: Can't spawn-NOWAIT: Resource temporarily 
unavailable at C:\
Progra~1\SWISH-E\lib\swish-e\perl/SWISH/Filter.pm line 1158

Final Content type for 
http://local.dev.port.com/pdf/audi_shee_040722.pdf is application/pdf
   *No filters were used


>>P.S. I'm still unable to get the Descriptions to work for non-PDF pages. 
>>I've spidered the site with PDF filtering off via the test_url option 
>>and I can't get the descriptions to appear. There must be something 
>>weird about our HTML pages in order to mess up the indexer.
> 
> 
> Maybe.  Again, make a tiny simple HTML page and spider it and see if
> it works.  If so, then you now it's not your config.  Then try one of
> your HTML pages and see what happens.  If nothing then turn on
> ParserWarnLevel 9 in the swish config file and/or validate the page's
> html.

OK - I figured out what is going on. The "Local Time - ..." that is 
appearing in the description *is* being harvested from the html. I 
grabbed a view source from one of our pages (since 90% of the content is 
coming out of a database for many of our hub pages) and ran swish-e 
against this now flat html file.

The Local Time is embedded at the first real text in the page via a time 
function. But strangely none of the other text on the page shows up. 
Just a lot of "space" filling up the rest of the description buffer. 
We'll looks like we'll have to wrap the Local Time in those special 
comments <!-- noindex --> <!-- index --> to by pass this "text" from 
indexing.

I've done a bit if testing with this route and it appears promising, 
just need to set them up so they don't interfere with the link spidering.
Received on Fri Sep 24 15:33:23 2004