Skip to main content.
home | support | download

Back to List Archive

Freezing up on PDFs...

From: Anthony Baratta <anthony(at)not-real.2plus2partners.com>
Date: Fri Aug 06 2004 - 22:14:13 GMT
This is a resend - I'm curious if anyone here is using Swish-e in a Windows 
Environment and whether they've seen issues with indexing the content of 
PDFs as described below. If more information about my setup is needed 
please let me know.

-------

I've been struggling with using swish-e on a Windows 2000 server. I'm
spidering the target site and when I hit a pdf file with "errors" (Missing
'endstream') the spider can lockup.

I've replaced the pdftotext program with the latest version (v3 1/22/2004)
and tested it on the problematic pdfs. It throws the same errors but does
create a "text" file with some garbage characters with all the text. It
appears that swish-e is either waiting for an exit code that never comes
from pdftotext or can not handle the output with garbage characters.

Has anyone else seen this?

Here's some config info, if necessary:

Swish-e v2.4.2 for windows

batch file for spidering (wrapped for reading)

"C:\Program Files\SWISH-E\swish-e.exe"
	-S prog -v 3 -c
	"C:\Program Files\SWISH-E\indexes\SiteName\SiteName.config"
	-f "
	C:\Program Files\SWISH-E\indexes\SiteName\index.swish-e"

config file

IndexDir perl.exe
SwishProgParameters "C:\\Progra~1\\SWISH-E\\lib\\swish-e\\spider.pl"
default "http://www.site.com"
ReplaceRules remove http://www.site.com

IndexContents HTML* .asp .htm .html .pdf
StoreDescription HTML* <body> 320 
Received on Fri Aug 6 15:14:29 2004