Liam Buchanan wrote on 2/12/08 5:43 PM:
> Hi,
>
> I am using spider.pl to crawl. I have only 1 pdf on the entire intranet
> as a test. I have tried both the domain and ip in the hyperlink.
> I did some extensive testing yesterday. The strange thing is if I use
> pdftotext or pdftohtml directly on a local file then it generates the
> output correctly.
> It seems to have a big problem opening the pdf after running swish-e.
> this same pdf can be opened directly from a browser (as a binary file)
> and as stated before it opens when directly applying pdftotext and
> pdftohtml in cmd.
> Heres the pdftohtml error:
>
> (523 words)
> http://*****.au/dsdweb/v4/apps/web/secure/docs/25.pdf - Using HTML
> 2 parser - Error: Couldn't open file ''http://*****.au/dsdweb/v4/a
> pps/web/secure/docs/25.pdf''
> (no words indexed)
>
> Also I am not sure how to turn on the -T debugging - can you assist me
> with this.
> Verbose is active.
>
Here's a brief example of how I test:
% cat spider.conf
my ($filter_sub, $response_sub) = swish_filter();
@servers = ({
skip => 0, # Flag to disable spidering this host.
base_url => 'http://peknet.com/~karpet/swish-e_documentation.pdf',
agent => 'swish-e spider http://swish-e.org/',
email => 'swish@domain.invalid',
# This will generate A LOT of debugging information to STDOUT
debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,
# Here are hooks to callback routines to validate urls and responses
# Probably a good idea to use them so you don't try to index
# Binary data. Look at content-type headers!
test_url => \&test_url,
test_response => $response_sub,
filter_content => $filter_sub,
} );
sub test_url {
my ( $uri, $server ) = @_;
return 1; # Ok to index/spider
}
1;
now run the spider.pl
% spider.pl spider.conf | swish-e -S prog -i stdin
spider.pl: Reading parameters from 'spider.conf'
-- Starting to spider: http://peknet.com/~karpet/swish-e_documentation.pdf --
Indexing Data Source: "External-Program"
Indexing "stdin"
vvvvvvvvvvvvvvvv HEADERS for http://peknet.com/~karpet/swish-e_documentation.pdf
vvvvvvvvvvvvvvvvvvvvv
---- Request ------
GET http://peknet.com/~karpet/swish-e_documentation.pdf
Accept-Encoding: gzip; deflate
From: swish@domain.invalid
User-Agent: swish-e spider http://swish-e.org/
---- Response ---
Status: 200 OK
Connection: close
Date: Wed, 13 Feb 2008 02:28:19 GMT
Accept-Ranges: bytes
ETag: "cecd0-c6835-266e1740"
Server: Apache/2.0.54 (Fedora)
Content-Length: 813109
Content-Type: application/pdf
Last-Modified: Thu, 01 Dec 2005 19:06:29 GMT
Client-Date: Wed, 13 Feb 2008 02:28:24 GMT
Client-Peer: 209.98.116.241:80
Client-Response-Num: 1
^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^
>> +Fetched 0 Cnt: 1 GET http://peknet.com/~karpet/swish-e_documentation.pdf
200 OK application/pdf 813109 parent: depth:0
Summary for: http://peknet.com/~karpet/swish-e_documentation.pdf
Connection: Close: 1 (0.1/sec)
Total Bytes: 486,739 (54082.1/sec)
Total Docs: 1 (0.1/sec)
Unique URLs: 1 (0.1/sec)
application/pdf->text/html: 1 (0.1/sec)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 3,835 words alphabetically
Writing header ...
Writing index entries ...
Writing word text: Complete
Writing word hash: Complete
Writing word data: Complete
3,835 unique words indexed.
4 properties sorted.
1 file indexed. 486,739 total bytes. 79,924 total words.
Elapsed time: 00:00:12 CPU time: 00:00:03
Indexing done!
--
Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 12 21:30:18 2008