I'm running Swish-E 2.2.1 on a Solaris 9 box. I got a filesystem index
working flawlessly, with PDFs being parsed as TXT using pdftotext.
Now, I'm trying to get it working using the prog method and spider.pl. The
crawl seems to works fine and HTML files get indexed using the HTML2
parser. I cannot get PDF files to index correctly. When I tried the pdf
function internal to spider.pl, the PDF files were parsed as HTML2s and
only
between 5 and 8 words per file were indexed. I know this is wrong because
the same PDF files with the filesystem index yield many more indexed
words.
I also tried using pdftotext and that doesn't index any words. Here's a
snippet from my swish-e config file.
IndexContents HTML2 .html .htm
StoreDescription HTML2 100000
FilterDir /opt/sfw/bin
FileFilter .pdf pdftotext "'%p' -"
IndexContents TXT .pdf
StoreDescription TXT 250000
Note that the same directives work perfectly when we do a filesystem index.
For some reason, they don't work with a prog / spider.pl crawl.
Received on Wed Dec 4 15:09:28 2002