Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Swish-e not indexing doc or PDF files

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Wed Feb 13 2008 - 02:29:54 GMT
Liam Buchanan wrote on 2/12/08 5:43 PM:
> Hi,
> 
> I am using spider.pl to crawl. I have only 1 pdf on the entire intranet
> as a test. I have tried both the domain and ip in the hyperlink.
> I did some extensive testing yesterday. The strange thing is if I use
> pdftotext or pdftohtml directly on a local file then it generates the
> output correctly.
> It seems to have a big problem opening the pdf after running swish-e.
> this same pdf can be opened directly from a browser (as a binary file)
> and as stated before it opens when directly applying pdftotext and
> pdftohtml in cmd.
> Heres the pdftohtml error:
> 
>  (523 words)
> http://*****.au/dsdweb/v4/apps/web/secure/docs/25.pdf - Using HTML
> 2 parser - Error: Couldn't open file ''http://*****.au/dsdweb/v4/a
> pps/web/secure/docs/25.pdf''
>  (no words indexed)
> 
> Also I am not sure how to turn on the -T debugging - can you assist me
> with this.
> Verbose is active.
> 

Here's a brief example of how I test:

% cat spider.conf
my ($filter_sub, $response_sub) = swish_filter();

@servers = ({
     skip        => 0,         # Flag to disable spidering this host.

     base_url    => 'http://peknet.com/~karpet/swish-e_documentation.pdf',

     agent       => 'swish-e spider http://swish-e.org/',
     email       => 'swish@domain.invalid',

     # This will generate A LOT of debugging information to STDOUT
     debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,


     # Here are hooks to callback routines to validate urls and responses
     # Probably a good idea to use them so you don't try to index
     # Binary data.  Look at content-type headers!

     test_url        => \&test_url,
     test_response   => $response_sub,
     filter_content  => $filter_sub,
} );

sub test_url {
  my ( $uri, $server ) = @_;
  return 1;  # Ok to index/spider
}

1;



now run the spider.pl

% spider.pl spider.conf | swish-e -S prog -i stdin

spider.pl: Reading parameters from 'spider.conf'

  -- Starting to spider: http://peknet.com/~karpet/swish-e_documentation.pdf --
Indexing Data Source: "External-Program"
Indexing "stdin"

vvvvvvvvvvvvvvvv HEADERS for http://peknet.com/~karpet/swish-e_documentation.pdf 
vvvvvvvvvvvvvvvvvvvvv

---- Request ------
GET http://peknet.com/~karpet/swish-e_documentation.pdf
Accept-Encoding: gzip; deflate
From: swish@domain.invalid
User-Agent: swish-e spider http://swish-e.org/


---- Response ---
Status: 200 OK
Connection: close
Date: Wed, 13 Feb 2008 02:28:19 GMT
Accept-Ranges: bytes
ETag: "cecd0-c6835-266e1740"
Server: Apache/2.0.54 (Fedora)
Content-Length: 813109
Content-Type: application/pdf
Last-Modified: Thu, 01 Dec 2005 19:06:29 GMT
Client-Date: Wed, 13 Feb 2008 02:28:24 GMT
Client-Peer: 209.98.116.241:80
Client-Response-Num: 1

^^^^^^^^^^^^^^^ END HEADERS ^^^^^^^^^^^^^^^^^^^^^^^^^^

 >> +Fetched 0 Cnt: 1 GET  http://peknet.com/~karpet/swish-e_documentation.pdf 
200 OK application/pdf 813109 parent: depth:0

Summary for: http://peknet.com/~karpet/swish-e_documentation.pdf
          Connection: Close:       1  (0.1/sec)
                Total Bytes: 486,739  (54082.1/sec)
                 Total Docs:       1  (0.1/sec)
                Unique URLs:       1  (0.1/sec)
application/pdf->text/html:       1  (0.1/sec)
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 3,835 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text: Complete
   Writing word hash: Complete
   Writing word data: Complete
3,835 unique words indexed.
4 properties sorted.
1 file indexed.  486,739 total bytes.  79,924 total words.
Elapsed time: 00:00:12 CPU time: 00:00:03
Indexing done!



-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Feb 12 21:30:18 2008