Skip to main content.
home | support | download

Back to List Archive

Re: Indexing PDFs on Windows - Revisited....

From: Anthony Baratta <Anthony(at)not-real.2plus2partners.com>
Date: Fri Sep 24 2004 - 20:39:05 GMT
Bill Moseley wrote:
> 
> So, if you are getting that then maybe your version of the filter
> is not looking at the correct content type?
> 
> Did you try running
> 
>    swish-filter-test -verbose http://local.dev.port.com/pdf/real_ccr.pdf
 >
 > Interesting.  Filters can get disabled if they abort (by calling die).
 > In Filter.pm it does this:
 >
 > That traps an exception in the individual filter.  Are you seeing that
 > warning?  It would give an error message.  And then after that point
 > the filter would not be used.
 >
 > If that's what is happening then that error message would be very
 > helpful.

I've run the index twice (with the max file size bumped down to 1MB) and 
the failure message appeared in same spot.


http://local.dev.port.com/portnyou/agendas/publ040826.asp
  - Using HTML2 parser -
  -Skipped http://local.dev.port.com/pdf/audi_shee_040722.pdf
   due to 'filter_content' user supplied function #1 death
   'Skipping http://local.dev.port.com/pdf/audi_shee_040722.pdf
   due to content type: application/pdf may be binary'

I've run the swish-filter-test against the first PDF to fail with 
"death" and the PDF that was filtered just before the first failure, 
both filtered successfully.

I did not find any error messages regarding the filter being disabled.

You can test the above PDF by using test.portofoakland.com instead of 
local.dev.port.com.

P.S. I'm still unable to get the Descriptions to work for non-PDF pages. 
I've spidered the site with PDF filtering off via the test_url option 
and I can't get the descriptions to appear. There must be something 
weird about our HTML pages in order to mess up the indexer.

e.g.

bat file:

"C:\Program Files\SWISH-E\swish-e.exe"
    -S prog -v 3
    -c "C:\Program Files\SWISH-E\indexes\Port\port.config"
    -f "C:\Program Files\SWISH-E\indexes\Port\index.swish-e"

port.config:

DefaultContents HTML*
StoreDescription HTML* <body> 320

IndexDir perl.exe
TmpDir "C:\\Progra~1\\SWISH-E\\indexes\\Tmp\\"
SwishProgParameters
     "C:\\Progra~1\\SWISH-E\\lib\\swish-e\\spider.pl"
     default "http://local.dev.port.com"
ReplaceRules remove http://local.dev.port.com


You can run this against the test.portofoakland.com after dumbing down 
the test_url to skip pdfs then run a search against the create index 
file. I still get no descriptions.
Received on Fri Sep 24 13:39:23 2004