Skip to main content.
home | support | download

Back to List Archive

RE: FW: PDF indexing suddenly stopped working

From: Chad Day <CDay(at)not-real.mindshare.net>
Date: Fri Dec 02 2005 - 18:10:42 GMT
Sorry, should have provided more detail.. I was doing a swish-e presentation in 30 minutes and then this broke, hence the panicking.

It doesn't hang or anything, it just skips the PDFs when indexing via HTTP.  I tried a filesystem index and it indexed the PDFs fine.  Doesn't even try to convert, just recognizes it's a pdf file, spits out that it's the wrong application type, and continues on indexing the rest of the site.

retrieving http://dev.website.org/files/Joomla%20Quick%20Start%201.0.pdf?PHPSESSID=ccb2b389c01304486171976163f94f82 (2)...
sleeping 1 seconds before fetching http://dev.website.org/files/Joomla%20Quick%20Start%201.0.pdf?PHPSESSID=ccb2b389c01304486171976163f94f82
Now fetching [http://dev.website.org/files/Joomla%20Quick%20Start%201.0.pdf?PHPSESSID=ccb2b389c01304486171976163f94f82]...Status: 200. application/pdf
Skipping http://dev.website.org/files/Joomla%20Quick%20Start%201.0.pdf?PHPSESSID=ccb2b389c01304486171976163f94f82:  Wrong content type: application/pdf.

(and then continues on to the next file) 

retrieving http://dev.website.org/index.php?option=com_content&task=view&id=4&Itemid=9&PHPSESSID=ccb2b389c01304486171976163f94f82 (2)...

I turned up this in the archives:

http://www.swish-e.org/archive/2002-02/3598.html

but it had no replies.

Can anyone suggest what to troubleshoot next?  I'm especially frustrated because this WAS working at one point, then this morning, zippy.  I had a cronjob to index it overnight, so I thought that may have been bad, removed it and the index file and rebuilt, but it didn't fix the issue.

Thanks,
Chad


-----Original Message-----
From: Peter Karman [mailto:peter@peknet.com] 
Sent: Friday, December 02, 2005 12:20 PM
To: Chad Day
Cc: Multiple recipients of list
Subject: Re: [SWISH-E] FW: PDF indexing suddenly stopped working

It wasn't clear to me from your email what "stopped working" means. Does
the PDF not get fetched via HTTP? Does it get fetched but doesn't get
converted? Does it get converted but not indexed? Does swish-e hang? Give
an error?

basic troubleshooting rules apply: break the process down into the steps
and see where it fails.

> Correctly formatted this time, my apologies.
>
> PDF indexing suddenly stopped working .. No idea why either. ☹
>
>>From the indexing process (swish-e –c swish.conf –v 3 –S http)
>
> retrieving
> http://dev.website.org/files/Joomla%20Quick%20Start.pdf?PHPSESSID=413c04013e7c3505db9a68bedf8a8951
> (3)...
> sleeping 1 seconds before fetching
> http://dev.website.org/files/Joomla%20Quick%20Start.pdf?PHPSESSID=413c04013e7c3505db9a68bedf8a8951
> Now fetching
> [http://dev.website.org/files/Joomla%20Quick%20Start%201.0.pdf?PHPSESSID=413c04013e7c3505db9a68bedf8a8951]...Status:
> 200. application/pdf
>
> $ cat swish.conf
> # Example configuration file
>
> # Tell Swish-e what to index (same as -i switch above)
> IndexDir http://dev.website.org/index.php
> IndexFile /usr/local/apache/htdocs/website.index
> IndexOnly .php .txt .html .htm .pdf .xml .htm .shtml
>
> # Index the PDF files
> FileFilter .pdf /usr/X11R6/bin/pdftotext '"%p" -'
>
> # Tell Swish-e that .txt files are to use the text parser.
> IndexContents TXT* .txt .pdf
> IndexContents XML* .xml
> IndexContents HTML* .htm .html .shtml .php
>
> PropertyNamesMaxLength 1000 swishdescription
> PropertyNameAlias swishdescription body
>
> StoreDescription TXT* 250000
> Delay 1
>
> # Otherwise, use the HTML parser
> DefaultContents HTML*
>
> Any ideas?
Received on Fri Dec 2 10:10:49 2005