Skip to main content.
home | support | download

Back to List Archive

[swish-e] problem indexing pdf

From: Manasa Kandula <m.kandula(at)>
Date: Wed Jul 02 2008 - 10:42:39 GMT
I am currently using the SWISH-E 2.4.5 version.
I have used swish to crawl some websites. I have used the default configuration setting. 
I have included an example of the command and the output of the spider
./ default > output1.txt

the output of the spider is 
Content-Length: 13996
Last-Mtime: 1214514833
Document-Type: HTML*

<meta name="creationdate" content="Fri Oct 26 12:46:25 2007">
<meta name="creator" content="Adobe LiveCycle Designer ES 8.1">
<meta name="encrypted" content="no">
<meta name="file_size" content="417680 bytes">
<meta name="moddate" content="Thu Jun 26 15:25:26 2008">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
<meta name="pages" content="4">
<meta name="pdf_version" content="1.6">
<meta name="producer" content="Adobe LiveCycle Designer ES 8.1">
<meta name="tagged" content="yes">
body of the document

The pdf file in the website has been successfully converted to the html format.
But, once I index the output of the spider 
(swish-e -f index.swish-e -c swish.config -S prog -i stdin < output1.txt)
, the part whose pathname ends with the pdf extention do not get indexed. (in this example it is the entire document that doesn't get indexed).
But, when i change the Pathname to Pathname: in the output1.txt file, the document gets indexed.

How can I solve this problem?

Users mailing list
Received on Wed Jul 2 06:35:54 2008