Skip to main content.
home | support | download

Back to List Archive

Re: Ignoring session ids when distinguishing files as being different

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Feb 22 2005 - 14:50:19 GMT
On Tue, Feb 22, 2005 at 03:29:44AM -0800, Stefan Seiz wrote:
> These session ids change and thus the swish-e crawler.pl will recognize
> pages as being different, allthough they are in fact the same pages (just
> the session-id changes).

There's two places to do that.  In your spider config file you can tell
it to strip them when extracting links from the document in a test_url()
callback function.  Or, you can use a filter_content() and modify the
url before the document is sent to swish for indexing.

I suspect you can also use "ReplaceRules" in your swish-e config file.

> 2) Password protected PDF files.
> All our PDFs are protected with the same password, so i can easily pass a
> password to the command line options of pdftotext.
> 
> So i tried modifying
>     /usr/local/lib/swish-e/perl/SWISH/Filters/Pdf2HTML.pm
> and tried to add "-opw MyPasswd" to the call to $self->run_pdftotext but
> failed miserably. I tried many different variations of adding the -opw
> option to pdftotext.

Maybe you needed to add it to the call to pdfinfo, too.

Test from the command line to make sure it works with pdfinfo and
pdftext, first.
Received on Tue Feb 22 06:50:19 2005