I definitely checked. I've ran and re-ran the search, changing only the
use_cookies line, and it either works (indexes the PDF fine) or breaks
(as below) depending on the existence of that line.
I've tried adding another PDF, even though I know the original is fine,
and it breaks as well depending on the case above.
What sense of this to make, I don't know.
Thanks,
Chad
-----Original Message-----
From: swish-e@sunsite3.berkeley.edu
[mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
Sent: Tuesday, December 06, 2005 4:58 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: duplicate entries in DB after regex performed on
URLs?
On Tue, Dec 06, 2005 at 04:30:41PM -0500, Chad Day wrote:
> http://dev.website.org/index.php?option=content&task=view&id=5 - Using
> HTML2 parser - (39 words)
>
http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
> d= - Using HTML2 parser - (33 words)
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref
table...
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table
That just looks like a broken pdf. Did you check?
> http://dev.website.org/files/Joomla Quick Start 1.0.pdf - Using HTML2
> parser - (no words indexed)
>
http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemi
> d=9 - Using HTML2 parser - (33 words)
>
> If I remove the use_cookies => 1, line from my spider.conf, it works
> fine and I return to having the issue of the PHPSESSIDs.
My guess is that with cookies you are indexing different files -- or
your site has some kind of problem.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Wed Dec 7 06:44:42 2005