Re: Spidering "file://" PDFs

From: Bill Moseley <moseley(at)>
Date: Tue Jul 05 2005 - 18:01:42 GMT
On Tue, Jul 05, 2005 at 07:42:57AM -0700, McQuiggin, Kevin wrote:
> My error in writing the URL, I have the correct syntax in the links!

Error in writing in the email or in the file you are indexing?  Couldn't be in
your email since you would have followed these instructions carefully:

and only cut-n-pasted your examples. ;)

So, what's the question?  How to use file:// URLs?

moseley@bumby:~/apache$ cat index.html
<a href="file:///home/moseley/apache/test.pdf">testpdf</html>

moseley(at)not-real.bumby:~/apache$ SPIDER_DEBUG=url,failed /usr/local/lib/swish-e/ default file:///home/moseley/apache/index.html  >/dev/null
/usr/local/lib/swish-e/ Reading parameters from 'default'

 -- Starting to spider: file:///home/moseley/apache/index.html --
>> +Fetched 0 Cnt: 1 GET  file:///home/moseley/apache/index.html  200 OK text/html 126 parent: depth:0
>> +Fetched 1 Cnt: 2 GET  file:///home/moseley/apache/test.pdf  200 OK application/pdf 1636685 parent:file:///home/moseley/apache/index.html depth:1

Summary for: file:///home/moseley/apache/index.html
         Connection: Close:      1  (0.5/sec)
    Connection: Keep-Alive:      1  (0.5/sec)
               Total Bytes: 43,932  (21966.0/sec)
                Total Docs:      2  (1.0/sec)
               Unique URLs:      2  (1.0/sec)
application/pdf->text/html:      1  (0.5/sec)
                 text/html:      1  (0.5/sec)

moseley(at)not-real.bumby:~/apache$ FILTER_DEBUG=1 /usr/local/lib/swish-e/ default file:///home/moseley/apache/index.html  >/dev/null
>> Starting to process new document: application/pdf
 ++Checking filter [SWISH::Filters::Doc2txt=HASH(0x84e7138)] for application/pdf
 ++Checking filter [SWISH::Filters::Doc2html=HASH(0x84e65f0)] for application/pdf
 ++Checking filter [SWISH::Filters::Pdf2HTML=HASH(0x84f2e8c)] for application/pdf
 ++ application/pdf *WAS* filtered by SWISH::Filters::Pdf2HTML=HASH(0x84f2e8c)

Final Content type for file:///home/moseley/apache/test.pdf is text/html
  >Filter SWISH::Filters::Pdf2HTML=HASH(0x84f2e8c) converted from [application/pdf] to [text/html]

Bill Moseley

