Skip to main content.
home | support | download

Back to List Archive

Re: duplicate entries in DB after regex performed on URLs?

From: Chad Day <CDay(at)>
Date: Wed Dec 07 2005 - 14:44:22 GMT
I definitely checked.  I've ran and re-ran the search, changing only the
use_cookies line, and it either works (indexes the PDF fine) or breaks
(as below) depending on the existence of that line.

I've tried adding another PDF, even though I know the original is fine,
and it breaks as well depending on the case above.

What sense of this to make, I don't know.


-----Original Message-----
[] On Behalf Of Bill Moseley
Sent: Tuesday, December 06, 2005 4:58 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: duplicate entries in DB after regex performed on

On Tue, Dec 06, 2005 at 04:30:41PM -0500, Chad Day wrote:
> - Using
> HTML2 parser -  (39 words)
> d= - Using HTML2 parser -  (33 words)
> Error: May not be a PDF file (continuing anyway)
> Error (0): PDF file is damaged - attempting to reconstruct xref
> Error: Couldn't find trailer dictionary
> Error: Couldn't read xref table

That just looks like a broken pdf.  Did you check?

> Quick Start 1.0.pdf - Using HTML2
> parser -  (no words indexed)
> d=9 - Using HTML2 parser -  (33 words)
> If I remove the use_cookies => 1, line from my spider.conf, it works
> fine and I return to having the issue of the PHPSESSIDs. 

My guess is that with cookies you are indexing different files -- or
your site has some kind of problem.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Wed Dec 7 06:44:42 2005