Skip to main content.
home | support | download

Back to List Archive

Re: index pdf files with spider.pl

From: Erik Lyons <ELyons(at)not-real.mail.open.org>
Date: Wed Jul 23 2003 - 18:22:15 GMT
OK very good I had neglected File::Temp required by pdf2html. 
I'm seeing PDFs in results now, but still titles of (NULL) on all docs.


>>> <moseley@hank.org> 07/23/03 09:26AM >>>
On Wed, Jul 23, 2003 at 08:25:16AM -0700, Erik Lyons wrote:
> 
> PDF transformed: 1  (1.0/sec)
>         Skipped: 1  (1.0/sec)
>     Unique URLs: 1  (1.0/sec)
> 
> # file test.html
> test.html: empty

Sorry, I'm about to leave for the day so I can't help one step at a 
time.

So it looks like the PDF was transformed but it was skipped.  perldoc 
spider.pl should explain somewhat how to turn on debugging flags to see

why it's being skipped.  Hopefully, that will make it clear why you are

not getting the results you are expecting.

You should have all the tools you need to debug -- enable debugging in

spider.pl to see why you are not getting output.  And once that's fixed

you can pipe that output file into swish and use -T debugging options 
with swish-e to verify what's being indexed by swish.

If that doesn't work, post a URL of the PDF in question and your 
spider.pl config file and I'll take a look tomorrow.

> 
> >>> <moseley@hank.org> 07/23/03 07:58AM >>>
> On Wed, Jul 23, 2003 at 07:54:51AM -0700, Erik Lyons wrote:
> > Thanks Bill,
> > 
> > Run this way, spider.pl appears to expect perl, so given the
"f.conf"
> 
> > example (list of directives) it fails in a bountiful blossom of
> syntax
> > errors. 
> 
> Right, sorry I wasn't clear:
> 
> >    spider.pl your_config_file.name > test.html
> 
> should be:
> 
>      spider.pl your_SPIDER_config_file.name > test.html
> 
> 
> 
> > 
> > >>> Bill Moseley <moseley@hank.org> 07/22/03 07:07PM >>>
> > On Tue, Jul 22, 2003 at 04:38:13PM -0700, Erik Lyons wrote:
> > > After several weeks of exclaiming joyful praise to the initial
"S"
> > in
> > > SWISH, I stumbled across the example quoted below. It runs and
> > reports
> > > "PDF transformed:      2,009  (19.7/sec)", but no PDF files can
be
> > > returned in any search results. As an added bonus, all document
> > titles
> > > that are in the search results appear as "(NULL)". Are these
> > problems
> > > related, or do I have 2 different gleaming horizons of delight
to
> > > explore?
> > 
> > Hard to say, but probably not hard to debug.
> > 
> > Edit the spider's config file to point to a single PDF file.  Then
> just
> > 
> > run the spider like:
> > 
> >    spider.pl your_config_file.name > test.html
> > 
> > and look at test.html and make sure it has a title and content.
> > 
> > Then you can index that one PDF with:
> > 
> >    cat test.html | swish-e -c your_config -S prog -i stdin -T
> > properties
> > 
> > the -T properties will show you if the title is being stored.
> > 
> > 
> > 
> > 
> > -- 
> > Bill Moseley
> > moseley@hank.org 
> > 
> 
> -- 
> Bill Moseley
> moseley@hank.org 
> 

-- 
Bill Moseley
moseley@hank.org 
Received on Wed Jul 23 18:22:33 2003