Skip to main content.
home | support | download

Back to List Archive

Re: spidering PDF files

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Sep 21 2004 - 21:30:42 GMT
On Tue, Sep 21, 2004 at 02:11:13PM -0700, Richard Morin wrote:
> I have wandered through several Swish-e documents, trying to
> figure out how to spider PDF files.  AFAICT, the current plan
> involves adding a line to spider.config:
> 
>    filter_content  => \&filter_content,

That works as long as &filter_content does something.

Here's the executive overview:

1) You need the Xpdf package which has the programs for converting the
pdf to text.

2) You need to follow the example in the SwishSpiderConfig (or
whatever it's called) example config for calling the Swish::Filter
module.  That's the filter_content() sub you mentioned above.

I think there's discussion of filtering in the INSTALL doc and in the
SWISH-CONFIG section on filtering.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Tue Sep 21 14:31:05 2004