Hi,
i am new to Swish-e, coming from HtDig.
While evaluating swish-e, i discovered two show-stoppers for our enviroment.
1)
Our Site is served dynamicaly ad the app-server includes sesison-ids in urls
which i can not turn off.
These session ids change and thus the swish-e crawler.pl will recognize
pages as being different, allthough they are in fact the same pages (just
the session-id changes).
Using htdig, i could work around this problem by one simple configuration
option:
url_rewrite_rules: (.*)&pb-id=.* \\1
(where pb-id=XXXXX is my session id)
Is there anything similar in swish-e to make it ignore the session id when
it distinguishes between files being different.
2) Password protected PDF files.
All our PDFs are protected with the same password, so i can easily pass a
password to the command line options of pdftotext.
So i tried modifying
/usr/local/lib/swish-e/perl/SWISH/Filters/Pdf2HTML.pm
and tried to add "-opw MyPasswd" to the call to $self->run_pdftotext but
failed miserably. I tried many different variations of adding the -opw
option to pdftotext.
Can anyone help me out as how i need to add the -opw option to the call to
pdftotext?
Thanks!
--
Stefan Seiz <http://www.StefanSeiz.com>
Spamto: <bin@imd.net>
Received on Tue Feb 22 03:30:51 2005