Skip to main content.
home | support | download

Back to List Archive

Re: Spidering PDF's with Swish

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Feb 06 2002 - 21:54:08 GMT
At 01:33 PM 02/06/02 -0800, AHatton@oxfam.org.uk wrote:
>Whilst using swish-e -S http.. etc works fine for indexing other content
>we can't get it to index PDF files.

>We are using version swish-e 2.0

I'd strongly recommend upgrading to the dev version.

I'd also strongly recommend using -S prog with spider.pl when using the
-dev version, but that's a minor issue here.


>#!/bin/sh
>#/usr/X11R6/bin/pdftotext -q $1 -
>#/usr/X11R6/bin/pdftotext "$1" - 2>/dev/null
>/usr/X11R6/bin/pdftotext "$1" -

You shouldn't need a shell script.  You should be able to call pdftotext
directly from the FileFilter command.

See: 
http://swish-e.org/2.2/docs/SWISH-CONFIG.html#Document_Filter_Directives

  FileFilter .pdf       pdftotext   "'%p' -"

Here's working with the current -dev  version:

> cat c
FileFilter .pdf       pdftotext   "'%p' -"
Delay 0

> ./swish-e -c c -S http -i http://www.sanface.com/epdtest.pdf -T
indexed_words

Indexing Data Source: "HTTP-Crawler"
Indexing "http://www.sanface.com/epdtest.pdf"
    Adding:[1:swishdefault(1)]   'test'   Pos:1  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'with'   Pos:2  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'the'   Pos:3  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'tiger'   Pos:4  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'epd'   Pos:5  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'converted'   Pos:6  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'from'   Pos:7  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'the'   Pos:8  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'standard'   Pos:9  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'postscript'   Pos:10  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'tiger'   Pos:11  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'by'   Pos:12  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'pstoepd'   Pos:13  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   'converter'   Pos:14  Stuct:0x1 ( FILE )
    Adding:[1:swishdefault(1)]   '1'   Pos:15  Stuct:0x1 ( FILE )
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 13 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
13 unique words indexed.
4 properties sorted.                                              
1 file indexed.  109922 total bytes.  15 total words.
Elapsed time: 00:00:03 CPU time: 00:00:00
Indexing done!

-- 
Bill Moseley
mailto:moseley@hank.org
Received on Wed Feb 6 21:56:59 2002