Skip to main content.
home | support | download

Back to List Archive

Problem on Parser with TXT/HTML and Spider.pl

From: Robert Keith <Robert(at)not-real.Technolords.com>
Date: Tue Apr 29 2003 - 22:30:50 GMT
I am having a strange problem indexing a combination of MSWord, .txt and PHP
documents using spider.pl and feeding this into swish-e.  If I index the PHP
urls first, the documents are parsed and loaded as HTML.  If I select the
MSWord and other documents, which are filtered by the spider.pl filter
routines, the MSWord and other documents are parsed as TXT (correctly), then
when the subsequent PHP and HTML documents are parsed, they are parsed as
TXT.  The SwishSpiderConfig.pl file contains two entries, the URL with the
MSWord links, and the URL with only PHP links.

This is the command I use to index:

perl /fs/area/intellisearch/search/prog-bin/spider.pl prof1.pl | swish-e -S
prog -c prof1 -i stdin  -v3

I have included two runs.  The only difference are in the
SwishSpiderConfig.pl (prof1.pl) file, the two SERVER entries are reversed.

If anyone has any hints on how I can troubleshoot this it would be much
appreciated.

Robert Keith


-----------------------------------------------------
The (good run) sysout output log looks like:

Parsing config file 'prof1'
Parsing config file '/fs/area/intellisearch/conf/common.config'
Indexing Data Source: "External-Program"
Indexing "stdin"
/fs/area/intellisearch/search/prog-bin/spider.pl: Reading parameters from
'prof1
.pl'
http://www.intellivence.com/index.php/ - Using HTML parser -  (221 words)
http://www.intellivence.com/index.php - Using HTML parser -  (221 words)
http://www.intellivence.com/index.php/index.php - Using HTML parser -  (221
word
s)

<<Snipped here>>> (more of the same...)

Need Authentication for http://www.intellivence.com/docs/QSG.doc at realm
'Intel
livence User'
(<Enter> skips)
Username:
Skipping http://www.intellivence.com/docs/QSG.doc

Summary for: http://www.intellivence.com/index.php/
    Duplicates:     507  (42.2/sec)
Off-site links:       1  (0.1/sec)
       Skipped:       1  (0.1/sec)
   Total Bytes: 339,153  (28262.8/sec)
    Total Docs:      38  (3.2/sec)
   Unique URLs:      40  (3.3/sec)
http://www.intellivence.com/news.php - Using HTML parser -  (220 words)
http://www.intellivence.com/downloads/ - Using HTML parser -  (32 words)
http://www.intellivence.com/downloads/example.doc - Using TXT parser -  (13
word
s)
http://www.intellivence.com/downloads/example.xls - Using HTML parser -  (21
wor
ds)

Summary for: http://theweb:access@www.intellivence.com/downloads/
    Skipped:       3  (0.2/sec)
Total Bytes: 166,450  (10403.1/sec)
 Total Docs:       6  (0.4/sec)
Unique URLs:       7  (0.4/sec)
http://www.intellivence.com/downloads/bigdoc.doc - Using TXT parser -
(34612 wo
rds)
http://www.intellivence.com/downloads/example.txt - Using TXT parser -  (17
word
s)
http://www.intellivence.com/downloads/good.html - Using TXT parser -  (67
words)

Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 10109 words alphabetically
Writing header ...
Writing index entries ...
  Writing word text: Complete
  Writing word hash: Complete
  Writing word data: Complete
10109 unique words indexed.
5 properties sorted.
44 files indexed.  505603 total bytes.  42626 total words.
Elapsed time: 00:00:29 CPU time: 00:00:00
Indexing done!

-------------------------------------------------

(The bad run)
Then the output when the SwishSpiderConfig.pl entries are reversed:

Parsing config file 'prof1'
Parsing config file '/fs/area/intellisearch/conf/common.config'
Indexing Data Source: "External-Program"
Indexing "stdin"
/fs/area/intellisearch/search/prog-bin/spider.pl: Reading parameters from
'prof1
.pl'
http://www.intellivence.com/downloads/ - Using HTML parser -  (32 words)
http://www.intellivence.com/downloads/example.doc - Using TXT parser -  (13
word
s)
http://www.intellivence.com/downloads/example.xls - Using HTML parser -  (21
wor
ds)

Summary for: http://theweb:access@www.intellivence.com/downloads/
    Skipped:       3  (0.2/sec)
Total Bytes: 166,450  (10403.1/sec)
 Total Docs:       6  (0.4/sec)
Unique URLs:       7  (0.4/sec)
http://www.intellivence.com/downloads/bigdoc.doc - Using TXT parser -
(34612 wo
rds)
http://www.intellivence.com/downloads/example.txt - Using TXT parser -  (17
word
s)
http://www.intellivence.com/downloads/good.html - Using TXT parser -  (67
words)
http://www.intellivence.com/index.php/ - Using TXT parser -  (1368 words)
http://www.intellivence.com/index.php - Using TXT parser -  (1368 words)

<<<Snipped again>>>>

*** Notice the html and php documents are parsed as TXT


=====================================================
The prof1 swish config file contains:

IncludeConfigFile /fs/area/search/conf/common.config
IndexFile /fs/area/search/indexfiles/prof1
SwishProgParameters /fs/area/search/conf/prof1.pl


=====================================================

The prof1.pl spider.pl config file contains:

use lib '/fs/area/intellisearch/search/prog-bin';

@servers = (

    {
#       skip        => 1,         # Flag to disable spidering this host.
        debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS |
DEBUG_INFO |
DEBUG_LINKS,

        base_url => 'http://www.intellivence.com/index.php/',

        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'swish@domain.invalid',
        delay_min   => .0001,     # Delay in minutes between requests
        max_time    => 10,        # Max time to spider in minutes
        max_files   => 60,        # Max files to spider
        ignore_robots_file => 0,  # Don't set that to one, unless you are
sure.

        use_cookies => 0,         # True will keep cookie jar
        use_md5     => 0,         # If true, this will use the Digest::MD5

        test_url        => \&test_url,
        test_response   => \&test_response,
        filter_content  => \&filter_content,
    },


    {
#       skip        => 1,         # Flag to disable spidering this host.
        debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS |
DEBUG_INFO |
DEBUG_LINKS,

        base_url => 'http://theweb:access@www.intellivence.com/downloads/',

        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'swish@domain.invalid',
        delay_min   => .0001,     # Delay in minutes between requests
        max_time    => 10,        # Max time to spider in minutes
        max_files   => 60,        # Max files to spider
        ignore_robots_file => 0,  # Don't set that to one, unless you are
sure.

        use_cookies => 0,         # True will keep cookie jar
        use_md5     => 0,         # If true, this will use the Digest::MD5

        test_url        => \&test_url,
        test_response   => \&test_response,
        filter_content  => \&filter_content,
    },


);


#---------------------- Public Functions ------------------------------
#
#----------------------------------------------------------------------


# This subroutine lets you check a URL before requesting the
# document from the server
# return false to skip the link

sub test_url {
    my ( $uri, $server ) = @_;
    # return 1;  # Ok to index/spider
    # return 0;  # No, don't index or spider;

    # ignore any common image files
#    return $uri->path !~ /\.(gif|jpg|jpeg|png)?$/;
    return (($uri->path =~ /\.(doc|html|htm|php|ppt|txt|xls|jsp)?$/) || (
$uri->
path =~ /\/$/ ));

}

# This routine is called when the *first* block of data comes back
# from the server.  If you return false no more content will be read
# from the server.  $response is a HTTP::Response object.


sub test_response {
    my ( $uri, $server, $response ) = @_;

#    $server->{no_contents}++ unless $response->content_type =~
m[^text/html];
    return 1;  # ok to index and spider
}

# This is an example of how to use the SWISH::Filter module included
# with the swish-e distribution.  Make sure that SWISH::Filter is
# in the @INC path (e.g. set PERL5LIB before running swish).
#
# Returns:
#      true if content-type is text/* or if the document was filtered
#      false if document was not filtered
#      aborts if module or filter object cannot be created.
#

my $filter;  # cache the object.

sub filter_content {
    my ( $uri, $server, $response, $content_ref ) = @_;

    my $content_type = $response->content_type;

    # Ignore text/* content type -- no need to filter
    return 1 if !$content_type || $content_type =~ m!^text/!;

    # Load the module - returns FALSE if cannot load module.
    unless ( $filter ) {
        eval { require SWISH::Filter };
        if ( $@ ) {
            $server->{abort} = $@;
            return;
        }
        $filter = SWISH::Filter->new;
        unless ( $filter ) {
            $server->{abort} = "Failed to create filter object";
            return;
        }
    }

    # If not filtered return false and doc will be ignored (not indexed)

    return unless $filter->filter(
        document => $content_ref,
        name     => $response->base,
        content_type => $content_type,
    );

    # nicer to use **char...
    $$content_ref = ${$filter->fetch_doc};

    # let's see if we can set the parser.
    $server->{parser_type} = $filter->swish_parser_type || '';

    return 1;
}





# Must return true...

1;
Received on Tue Apr 29 22:34:35 2003