Skip to main content.
home | support | download

Back to List Archive

[swish-e] problem indexing pdf

From: Manasa Kandula <m.kandula(at)not-real.RUG.nl>
Date: Wed Jul 02 2008 - 10:42:39 GMT
Hello,
I am currently using the SWISH-E 2.4.5 version.
I have used swish spider.pl to crawl some websites. I have used the default configuration setting. 
I have included an example of the command and the output of the spider
./spider.pl default  http://www.uscis.gov/files/form/I-9.pdf > output1.txt

the output of the spider is 
Path-Name: http://www.uscis.gov/files/form/I-9.pdf
Content-Length: 13996
Last-Mtime: 1214514833
Document-Type: HTML*

<html>
<head>
<meta name="creationdate" content="Fri Oct 26 12:46:25 2007">
<meta name="creator" content="Adobe LiveCycle Designer ES 8.1">
<meta name="encrypted" content="no">
<meta name="file_size" content="417680 bytes">
<meta name="moddate" content="Thu Jun 26 15:25:26 2008">
<meta name="optimized" content="yes">
<meta name="page_size" content="612 x 792 pts (letter)">
<meta name="pages" content="4">
<meta name="pdf_version" content="1.6">
<meta name="producer" content="Adobe LiveCycle Designer ES 8.1">
<meta name="tagged" content="yes">
</head>
<body>
<pre>
body of the document
</pre>
</body>
</html>


The pdf file in the website has been successfully converted to the html format.
But, once I index the output of the spider 
(swish-e -f index.swish-e -c swish.config -S prog -i stdin < output1.txt)
, the part whose pathname ends with the pdf extention do not get indexed. (in this example it is the entire document that doesn't get indexed).
But, when i change the Pathname to Pathname: http://www.uscis.gov/files/form/I-9.txt in the output1.txt file, the document gets indexed.

How can I solve this problem?
Thanks,
Manasa


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jul 2 06:35:54 2008