Skip to main content.
home | support | download

Back to List Archive

Problems indexing PDF files using HTTP crawler

From: Rosalyn Hatcher <r.s.hatcher(at)not-real.reading.ac.uk>
Date: Fri Jan 06 2006 - 11:48:58 GMT
Hi,

I'm having trouble getting swish-e to index PDF files using spider.pl 
and am now at a loss
as to where to look next.  I've looked on the swish-e site but have 
failed to find any further info
that helps me with this problem. I'm using pdftotext to do the 
conversion. I have successfully got
swish-e to index a single PDF file using the -S fs and -S http options, 
but can't for the life
of me figure out why it won't work crawling the web server.  Can anyone 
shed any
light as to what I'm possibly doing wrong??

Any help much appreciated. Thanks.
Rosalyn.

The output I get is...

prismweb@hermes> /usr/local/bin/swish-e -c swish.conf -S prog
Indexing Data Source: "External-Program"
Indexing "/usr/local/lib/swish-e/spider.pl"
External Program found: /usr/local/lib/swish-e/spider.pl
/usr/local/lib/swish-e/spider.pl: Reading parameters from 
'spider_prism.config'

Summary for: http://prism.enes.org/Publications/Reports/Report05.pdf
         Connection: Close:      1  (1.0/sec)
               Total Bytes: 72,475  (72475.0/sec)
                Total Docs:      1  (1.0/sec)
               Unique URLs:      1  (1.0/sec)
application/pdf->text/html:      1  (1.0/sec)
Error: May not be a PDF file (continuing anyway)
Error (0): PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table
http://prism.enes.org/Publications/Reports/Report05.pdf - Using HTML2 
parser -  (no words indexed)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 8 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
8 unique words indexed.
5 properties sorted.                                             
1 file indexed.  72,475 total bytes.  8 total words.
Elapsed time: 00:00:02 CPU time: 00:00:00
Indexing done!

**My swish.conf contains...

prismweb@hermes> more swish.conf
# Administrative Directives
IndexName "PRISM Site Index"
IndexDescription "This is a swish index of the PRISM web site!"
IndexAdmin "R.S.Hatcher <r.s.hatcher@rdg.ac.uk>"

IndexFile 
/export/hermes/hermes-01/apache/htdocs/htdocs-prism/live/search/swish_files/prism.index

ReplaceRules replace 
/export/hermes/hermes-01/apache/htdocs/htdocs-prism/ http://prism.enes.org/

# Use spider.pl as the external program:
IndexDir /usr/local/lib/swish-e/spider.pl

# now make the specific configuration file for the spider.pl - all those 
file wou don't want spidered in prism
SwishProgParameters spider_prism.config

obeyRobotsNoIndex yes

IndexReport 3
# This is how detailed you want reporting. You can specify numbers
# 0 to 3 - 0 is totally silent, 3 is the most verbose.

IndexContents HTML* .htm .html .php 
IndexContents TXT* .txt
   
# Otherwise, use the HTML parser
DefaultContents HTML*

NoContents .gif .jpg .jpeg .ps 

FileFilter .pdf pdftotext "'%p' -"

IgnoreTotalWordCountWhenRanking yes

ConvertHTMLEntities yes

# Allow extra searching by title, path, description
Metanames swishtitle swishdocpath swishdescription

# Set StoreDescription for each parser
#  to display context with search results
StoreDescription TXT* 10000
StoreDescription HTML* <body> 10000

**and the spider_prism.config contains...

@servers = (
            {
#                base_url                => 
'http://prism.enes.org/index.php',
                base_url                => 
'http://prism.enes.org/Publications/Reports/Report05.pdf',
                same_hosts              => 'www.prism.enes.org',
                email                   => 'r.s.hatcher@rdg.ac.uk',
                use_default_config      => 1,
                use_md5                 => 1,   # If true, this will use 
the Digest::MD5
                                                # module to create 
checksums on content
                                                # This will very likely 
catch files
                                                # with differet URLs 
that are the same
                                                # content.  Will trap / 
and /index.html,
                                                # for example.
                delay_sec               => 0,   # Delay in seconds 
between requests
                remove_leading_dots     => 1,
                keep_alive              => 1,   # Try to keep the 
connection open
                    
                test_url => \&test_url,
            },
            );
1;

sub test_url {
    use URI::QueryParam;
    my $uri = shift;  
   
    # if sort_orderis in theURL then don't return it
    my $id = $uri->query_param('sort_order');
    return 0 if $id && $id =~ /ASC|DESC/; 
                   
    return 0 if $uri->path =~ /_inc|Connections/;
    return 0 if $uri->path =~ /Images|css|Templates/;
    return 0 if $uri->path =~ /graph|admin|make_pdf/;
    return 0 if $uri->path =~ /Internal/;
    return 0 if $uri->path =~ /\.(xml|old|css)?$/;
    return 0 if $uri->path =~ /Documentation/;
   
    return 1 if $uri->path =~ /\.(html|htm|php|pdf)?$/;
}

1;

**Using http method:
prismweb@hermes> local/bin/swish-e -c swish.conf -S http -i 
http://www.prism.enes.org/Publications/Reports/Report05.pdf
Indexing Data Source: "HTTP-Crawler"
Indexing "http://www.prism.enes.org/Publications/Reports/Report05.pdf"
Now fetching [http://www.prism.enes.org/robots.txt]...Status: 404.
retrieving http://www.prism.enes.org/Publications/Reports/Report05.pdf 
(0)...
sleeping 5 seconds before fetching 
http://www.prism.enes.org/Publications/Reports/Report05.pdf
Now fetching 
[http://www.prism.enes.org/Publications/Reports/Report05.pdf]...Status: 
200. application/pdf
 - Using HTML2 parser -  (11861 words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 1,337 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
1,337 unique words indexed.
5 properties sorted.                                             
1 file indexed.  433,766 total bytes.  11,870 total words.
Elapsed time: 00:00:08 CPU time: 00:00:00
Indexing done!

Sure enough index has been created ok
prismweb@hermes> /usr/local/bin/swish-e -w SRE -f 
prism.index                                                        
# SWISH format: 2.4.3
# Search words: SRE
# Removed stopwords:
# Number of hits: 1
# Search time: 0.001 seconds
# Run time: 0.031 seconds
1000 http://www.prism.enes.org/Publications/Reports/Report05.pdf 
"Report05.pdf" 433766
.
prismweb@hermes>
Similar results for
prismweb@hermes> local/bin/swish-e -c swish.conf -S fs -i 
/home/prismweb/live/Publications/Reports/Report05.pdf

-- 
------------------------------------------------------------------------
Rosalyn Hatcher
CGAM, Dept. of Meteorology, University of Reading, 
Earley Gate, Reading. RG6 6BB
Email: r.s.hatcher@reading.ac.uk     Tel: +44 (0) 118 378 7841
Received on Fri Jan 6 03:49:00 2006