> Well, I can't reproduce the problem on Windows 2000, PERL 5.6.1
> (ActiveState build 631), SWISH-E 2.2. For reference, I tested clean
> installs of 2.2rc1, 2.2, 2.2a, and 2.2 installed over top of 2.2rc1.
>
> A minimal config file and URL that reproduces the problem might be
> helpful, as usual.
OK. URL is not live (got the unlucky task of moving a PHP/Linux site to
PHP/Windows - don't ask - and it's still on the dev server). Configs are
below.
I've narrowed the problem down to the "No-Content: 1" header - if I remove:
<meta name="robots" content="nocontents">
from the first file being spidered (listed below), swish-e is happy.
I've got a single page (swish.php - I've replaced it with hello world for
testing) pulled from database that includes links to everything I want
indexed (to a depth of 1). I've done this to avoid unneeded load on site,
and would rather not have this page showing in the results. Is there an easy
workaround, or should I stick with rc1?
Thanks for the help,
Matt
########################################################
# swish-e configuration options common to all searches
# Specify the spider program to run
IndexDir prog-bin/spider.pl
# Store the description meta tag in the index
StoreDescription HTML <meta_description> 200
# exclude words of less than 3 characters
# MinWordLimit 3
# meta tags to recognise
MetaNames keywords description
# other meta tags... index 'em!
UndefinedMetaTags index
# include the description meta tag
PropertyNames description
# use ascii7 characters to remove confusion over accents, etc.
TranslateCharacters :ascii7:
########################################################
Then site-specific config:
########################################################
# pretty text
IndexName "ecostas.com english swish-e index"
IndexDescription "index of ecostas.com site, including descriptions"
IndexPointer http://www.ecostas.com/swish.php
IndexAdmin "Matt Kynaston <matt.kynaston@etbroker.com>"
# include common config
IncludeConfigFile "c:/Program Files/SWISH-E/conf/common.config"
# index file to create...
IndexFile e:/Inetpub/wwwroot/ecostas.com/common/search/ecostas.com.index_en
# config file for spider to use...
SwishProgParameters
"e:/Inetpub/wwwroot/ecostas.com/common/search/ecostas.com_en.spider.config.p
l"
########################################################
Spider's config file is:
########################################################
=pod
=head1 NAME
ecostas.com_en.spider.config.pl - swish-e spider configuration for english
ecostas.com
=head1 DESCRIPTION
Sets up the swish spider to start at base_url and only spider to a depth of
one - ie. base_url MUST contain a link to every file that is to be included
in the search. Only spiders files with extension .htm[l] or .php
Please see C<perldoc spider.pl> for more information.
=cut
#--------------------- Global Config ----------------------------
# @servers is a list of hashes -- so you can spider more than one site
# in one run (or different parts of the same tree)
# The main program expects to use this array (@SwishSpiderConfig::servers).
### Please do not spider these examples -- spider your own servers, with
permission ####
@servers = (
#===========================================================================
==
# standard options - limits depth to 1: the sitemap page must link to
every
# page on the site that we are interested in!
{
skip => 0, # skip spidering this server
base_url => 'http://dev.ecostas.com/swish.php',
same_hosts => [ qw/ecostas.com www.ecostas.com/ ],
agent => 'swish-e spider',
email => 'matt@etbroker.com',
link_tags => [qw/ a frame /],
max_depth => 1,
# limit to only .html and .php files
test_url => sub { $_[0]->path =~ /\.(html|htm|php)?$/ }
}
);
# Must return true...
1;
########################################################
page being spidered (swish.php - cut down to nothing for testing...) is:
########################################################
<html>
<head>
<title>Untitled</title>
<meta http-equiv="Content-Type" content="text/html;">
<!-- don't let swish index contents of this page, just follow links -->
<meta name="robots" content="nocontents">
<link rel="stylesheet" href="stylesheet.css" type="text/css">
</head>
<body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0">
<body>
<p>Hello world</p>
</body>
</html>
########################################################
*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Sep 26 16:53:28 2002