OK very good I had neglected File::Temp required by pdf2html.
I'm seeing PDFs in results now, but still titles of (NULL) on all docs.
>>> <moseley@hank.org> 07/23/03 09:26AM >>>
On Wed, Jul 23, 2003 at 08:25:16AM -0700, Erik Lyons wrote:
>
> PDF transformed: 1 (1.0/sec)
> Skipped: 1 (1.0/sec)
> Unique URLs: 1 (1.0/sec)
>
> # file test.html
> test.html: empty
Sorry, I'm about to leave for the day so I can't help one step at a
time.
So it looks like the PDF was transformed but it was skipped. perldoc
spider.pl should explain somewhat how to turn on debugging flags to see
why it's being skipped. Hopefully, that will make it clear why you are
not getting the results you are expecting.
You should have all the tools you need to debug -- enable debugging in
spider.pl to see why you are not getting output. And once that's fixed
you can pipe that output file into swish and use -T debugging options
with swish-e to verify what's being indexed by swish.
If that doesn't work, post a URL of the PDF in question and your
spider.pl config file and I'll take a look tomorrow.
>
> >>> <moseley@hank.org> 07/23/03 07:58AM >>>
> On Wed, Jul 23, 2003 at 07:54:51AM -0700, Erik Lyons wrote:
> > Thanks Bill,
> >
> > Run this way, spider.pl appears to expect perl, so given the
"f.conf"
>
> > example (list of directives) it fails in a bountiful blossom of
> syntax
> > errors.
>
> Right, sorry I wasn't clear:
>
> > spider.pl your_config_file.name > test.html
>
> should be:
>
> spider.pl your_SPIDER_config_file.name > test.html
>
>
>
> >
> > >>> Bill Moseley <moseley@hank.org> 07/22/03 07:07PM >>>
> > On Tue, Jul 22, 2003 at 04:38:13PM -0700, Erik Lyons wrote:
> > > After several weeks of exclaiming joyful praise to the initial
"S"
> > in
> > > SWISH, I stumbled across the example quoted below. It runs and
> > reports
> > > "PDF transformed: 2,009 (19.7/sec)", but no PDF files can
be
> > > returned in any search results. As an added bonus, all document
> > titles
> > > that are in the search results appear as "(NULL)". Are these
> > problems
> > > related, or do I have 2 different gleaming horizons of delight
to
> > > explore?
> >
> > Hard to say, but probably not hard to debug.
> >
> > Edit the spider's config file to point to a single PDF file. Then
> just
> >
> > run the spider like:
> >
> > spider.pl your_config_file.name > test.html
> >
> > and look at test.html and make sure it has a title and content.
> >
> > Then you can index that one PDF with:
> >
> > cat test.html | swish-e -c your_config -S prog -i stdin -T
> > properties
> >
> > the -T properties will show you if the title is being stored.
> >
> >
> >
> >
> > --
> > Bill Moseley
> > moseley@hank.org
> >
>
> --
> Bill Moseley
> moseley@hank.org
>
--
Bill Moseley
moseley@hank.org
Received on Wed Jul 23 18:22:33 2003