On Wed, 4 Dec 2002 Jeffrey.Grunstein@ny.frb.org wrote:
> I'm running Swish-E 2.2.1 on a Solaris 9 box. I got a filesystem index
> working flawlessly, with PDFs being parsed as TXT using pdftotext.
>
> Now, I'm trying to get it working using the prog method and spider.pl. The
> crawl seems to works fine and HTML files get indexed using the HTML2
> parser. I cannot get PDF files to index correctly. When I tried the pdf
> function internal to spider.pl, the PDF files were parsed as HTML2s and
> only
> between 5 and 8 words per file were indexed. I know this is wrong because
> the same PDF files with the filesystem index yield many more indexed
> words.
>
> FilterDir /opt/sfw/bin
> FileFilter .pdf pdftotext "'%p' -"
http://www.swish-e.org/current/docs/CHANGES.html#Version_2_2_2_November_14_2002
Or you can filter in the spider.pl program.
--
Bill Moseley moseley@hank.org
Received on Wed Dec 4 15:41:23 2002