Skip to main content.
home | support | download

Back to List Archive

Re: win2k unknown header problem

From: Matt Kynaston <Matt.Kynaston(at)not-real.etbroker.com>
Date: Thu Sep 26 2002 - 16:48:21 GMT
> Well, I can't reproduce the problem on Windows 2000, PERL 5.6.1
> (ActiveState build 631), SWISH-E 2.2.  For reference, I tested clean
> installs of 2.2rc1, 2.2, 2.2a, and 2.2 installed over top of 2.2rc1.
>
> A minimal config file and URL that reproduces the problem might be
> helpful, as usual.

OK. URL is not live (got the unlucky task of moving a PHP/Linux site to
PHP/Windows - don't ask - and it's still on the dev server). Configs are
below.

I've narrowed the problem down to the "No-Content: 1" header - if I remove:
	<meta name="robots" content="nocontents">
from the first file being spidered (listed below), swish-e is happy.

I've got a single page (swish.php - I've replaced it with hello world for
testing) pulled from database that includes links to everything I want
indexed (to a depth of 1). I've done this to avoid unneeded load on site,
and would rather not have this page showing in the results. Is there an easy
workaround, or should I stick with rc1?

Thanks for the help,

Matt

########################################################
# swish-e configuration options common to all searches

# Specify the spider program to run
IndexDir prog-bin/spider.pl

# Store the description meta tag in the index
StoreDescription HTML <meta_description> 200

# exclude words of less than 3 characters
# MinWordLimit 3

# meta tags to recognise
MetaNames keywords description

# other meta tags... index 'em!
UndefinedMetaTags index

# include the description meta tag
PropertyNames description

# use ascii7 characters to remove confusion over accents, etc.
TranslateCharacters :ascii7:
########################################################

Then site-specific config:

########################################################
# pretty text
IndexName "ecostas.com english swish-e index"
IndexDescription "index of ecostas.com site, including descriptions"
IndexPointer http://www.ecostas.com/swish.php
IndexAdmin "Matt Kynaston <matt.kynaston@etbroker.com>"

# include common config
IncludeConfigFile "c:/Program Files/SWISH-E/conf/common.config"

# index file to create...
IndexFile e:/Inetpub/wwwroot/ecostas.com/common/search/ecostas.com.index_en

# config file for spider to use...
SwishProgParameters
"e:/Inetpub/wwwroot/ecostas.com/common/search/ecostas.com_en.spider.config.p
l"
########################################################

Spider's config file is:

########################################################
=pod

=head1 NAME

ecostas.com_en.spider.config.pl - swish-e spider configuration for english
ecostas.com

=head1 DESCRIPTION

Sets up the swish spider to start at base_url and only spider to a depth of
one - ie. base_url MUST contain a link to every file that is to be included
in the search. Only spiders files with extension .htm[l] or .php


Please see C<perldoc spider.pl> for more information.

=cut

#--------------------- Global Config ----------------------------

#  @servers is a list of hashes -- so you can spider more than one site
#  in one run (or different parts of the same tree)
#  The main program expects to use this array (@SwishSpiderConfig::servers).

  ### Please do not spider these examples -- spider your own servers, with
permission ####

@servers = (


#===========================================================================
==
    # standard options - limits depth to 1: the sitemap page must link to
every
    # page on the site that we are interested in!
    {
        skip        => 0,  # skip spidering this server

        base_url    => 'http://dev.ecostas.com/swish.php',
        same_hosts  => [ qw/ecostas.com www.ecostas.com/ ],
        agent       => 'swish-e spider',
        email       => 'matt@etbroker.com',
        link_tags   => [qw/ a frame /],
        max_depth   => 1,

        # limit to only .html and .php files
        test_url    => sub { $_[0]->path =~ /\.(html|htm|php)?$/ }
    }
);


# Must return true...

1;
########################################################

page being spidered (swish.php - cut down to nothing for testing...) is:

########################################################
<html>
  <head>
    <title>Untitled</title>
    <meta http-equiv="Content-Type" content="text/html;">
    <!-- don't let swish index contents of this page, just follow links -->
	<meta name="robots" content="nocontents">
	<link rel="stylesheet" href="stylesheet.css" type="text/css">
  </head>

  <body bgcolor="#ffffff" leftmargin="0" topmargin="0" marginwidth="0">
  <body>
  <p>Hello world</p>
  </body>
</html>
########################################################




*********************************************************************
Due to deletion of content types excluded from this list by policy,
this multipart message was reduced to a single part, and from there
to a plain text message.
*********************************************************************
Received on Thu Sep 26 16:53:28 2002