Hi all,
I have a file that swish fails to index.
output is:
Warning: External program returned zero Content-Length when processing
file'http://www.admiralmotorinn.com.au/index.php?pageid=3746'
http://www.admiralmotorinn.com.au/index.php?pageid=3746 - Using DEFAULT
(HTML2) parser - (no words indexed)
err: External program failed to return required headers Path-Name:
.
version: SWISH-E 2.4.7
uname -a: Linux ganymede 2.4.26 #6 Mon Jun 14 19:07:27 PDT 2004 i686
unknown unknown GNU/Linux
commandline is /usr/local/bin/swish-e -S prog -c
/var/www/indexes/config/swish.www.admiralmotorinn.com.au.conf -v3
swish.www.admiralmotorinn.com.au.conf contents:
###########################################
# Use the 'spider.pl' program included with Swish-e
IndexDir spider.pl
# Define what site to index
SwishProgParameters
/var/www/indexes/config/spider.www.admiralmotorinn.com.au.config
# Allow extra searching by title, path
Metanames swishtitle swishdocpath
# StoreDescription HTML* <body> 200000
IndexFile /var/www/indexes/index.www.admiralmotorinn.com.au.swish-e
MetaNames description keywords
PropertyNames description keywords
IgnoreWords File: /var/www/indexes/stopwords
############################################
spider.www.admiralmotorinn.com.au.config contents:
##############################################
@servers = ({
base_url => 'http://www.admiralmotorinn.com.au/',
agent => 'swish-e spider http://swish-e.org/',
email => 'sysadmin@winradius.com',
# This will generate A LOT of debugging information to STDOUT
# debug => DEBUG_URL | DEBUG_SKIPPED | DEBUG_HEADERS,
delay_sec => 0, # Delay in seconds between requests
keep_alive => 1,
# Here are hooks to callback routines to validate urls and responses
# Probably a good idea to use them so you don't try to index
# Binary data. Look at content-type headers!
test_url => \&test_url,
} );
sub test_url {
my ( $uri, $server ) = @_;
# return 1; # Ok to index/spider
# return 0; # No, don't index or spider;
# ignore any common image files
return 0 if $uri->path =~ /\.(gif|jpg|jpeg|png)?$/;
# make sure that the path is limited to the docs path
# return $uri->path =~ m[^/current/docs/];
return 0 if $uri->path =~ m[/controlpanel/];
return 0 if $uri->path =~ m[/scripts/];
return 1;
}
1;
##############################################
any help greatly appreciated.
The same config files work ok for about 100 other sites.
Ullas
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Tue Jun 9 00:16:21 2009