Skip to main content.
home | support | download

Back to List Archive

[swish-e] (no subject)

From: Ullas <ullas(at)not-real.burgundysky.com>
Date: Tue Jun 09 2009 - 04:16:23 GMT
Hi all,

I have a file that swish fails to index.

output is:

Warning: External program returned zero Content-Length when processing
file'http://www.admiralmotorinn.com.au/index.php?pageid=3746'
http://www.admiralmotorinn.com.au/index.php?pageid=3746 - Using DEFAULT
(HTML2) parser -  (no words indexed)
err: External program failed to return required headers Path-Name:
.

version: SWISH-E 2.4.7
uname -a: Linux ganymede 2.4.26 #6 Mon Jun 14 19:07:27 PDT 2004 i686
unknown unknown GNU/Linux

commandline is /usr/local/bin/swish-e -S prog -c
/var/www/indexes/config/swish.www.admiralmotorinn.com.au.conf -v3

swish.www.admiralmotorinn.com.au.conf contents:
###########################################
# Use the 'spider.pl' program included with Swish-e
IndexDir spider.pl
# Define what site to index
SwishProgParameters
/var/www/indexes/config/spider.www.admiralmotorinn.com.au.config

# Allow extra searching by title, path
Metanames swishtitle swishdocpath

# StoreDescription HTML* <body> 200000

IndexFile /var/www/indexes/index.www.admiralmotorinn.com.au.swish-e

MetaNames description keywords
PropertyNames description keywords

IgnoreWords File: /var/www/indexes/stopwords
############################################


spider.www.admiralmotorinn.com.au.config contents:

##############################################
@servers = ({
        base_url    => 'http://www.admiralmotorinn.com.au/',
        agent       => 'swish-e spider http://swish-e.org/',
        email       => 'sysadmin@winradius.com',

        # This will generate A LOT of debugging information to STDOUT
        # debug       => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,

        delay_sec   => 0,         # Delay in seconds between requests
        keep_alive  => 1,

        # Here are hooks to callback routines to validate urls and responses
        # Probably a good idea to use them so you don't try to index
        # Binary data.  Look at content-type headers!

        test_url        => \&test_url,
} );

sub test_url {
        my ( $uri, $server ) = @_;
        # return 1;  # Ok to index/spider
        # return 0;  # No, don't index or spider;

        # ignore any common image files
        return 0 if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;

        # make sure that the path is limited to the docs path
        # return $uri->path =~ m[^/current/docs/];
        return 0 if $uri->path =~ m[/controlpanel/];
        return 0 if $uri->path =~ m[/scripts/];
        return 1;
}
1;
##############################################

any help greatly appreciated.

The same config files work ok for about 100 other sites.

Ullas


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Jun 9 00:16:21 2009