Skip to main content.
home | support | download

Back to List Archive

[swish-e] Indexing problem

From: Lyle Jensen <lyle.jensen(at)>
Date: Fri Nov 05 2010 - 18:50:14 GMT
I'm having trouble getting SWISH-e to work with IIS unless Directory
Browsing is turned on, and I don't want to do that.

SWISH-e runs on the same server as IIS.  The desired content in a virtual
folder, /docs.  Underneath that folder are several additional folders at the
next level containing pdf & doc files.  If I run SWISH-e with Directory
Browsing turned on for the /docs virtual folder, everything indexes as
expected.  However we don't want to allow wide open access to browsing those
directories.  If add a default document, nobrowse.php to /docs and the each
folder below it, indexing fails.  It gets to the nobrowse.php document and
stops.  And of course, if I turn Directory Browsing off, indexing fails.

How can I get SWISH-e to index the files in /docs?


SWISH-e v2.43
Windows Server 2003 Enterprise x64, IIS 6.0

command line: c:\SWISH-e\swish-e.exe  -e -v 3 -c C:\swish-e\swish.conf -S
prog 1>C:\swish-e\swish_stdout.txt 2>C:\swish-e\swish_stderr.txt

IndexDir /perl/bin/perl.exe
IndexFile /inetpub/wwwroot/swish/index.swish-e
IndexOnly .html .htm .pdf .doc
@servers = (
  use_default_config  => 1,
  email               => '',
  base_url            => '',
    test_url            => \&test_url,
    test_response            => \&response_sub,
      # delay_sec should be commented out in production
      delay_sec           => 0,
    max_time            => 90,    # Max time to spider in minutes - changed
19Oct10 lj
    max_wait_time    => 180,    # Max time in seconds for spider to wait for
data to be returned - added 19Oct10 lj
    max_size                => 0,    # Override max size of 5mb
    keep_alive          => 1,
    #This is OK if we are indexing our own site
    ignore_robots_file  => 1,
    #Use this one in production
    debug               => 'skipped,errors',

sub test_url {
  my ( $uri, $server ) = @_;
  # return 1;  # Ok to index/spider
  # return 0;  # No, don't index or spider

  # make sure that the path is limited to the swish path
  #print STDERR "Checking $uri->path\r\n";
  return 0 if $uri->path !~ m[^/docs]i;
  #return 0 if $uri->path =~ m[^/docs/save]i;

  # ignore any of these file types
  if ($uri->path =~
/\.(css|gif|jpeg|jpg|png|asp|php|ppt|pptx|mp4|wmv|asx|msi|arf)?$/i ) {
    #print STDERR "Skipping $uri->path, this file type is excluded\r\n";
    return 0;
  return 1;

  # This is used in HEAD request to test the content type ahead of time
  sub response_sub {

        my ( $uri, $server, $response, $content_ref ) = @_;
        my $content_type = $response->content_type;
        return 1 if $content_type =~ m!^text/!;  # allow all text (assume we
don't want to filter)
        return 1 if $content_type =~ m[^application/msword]i; # allow word
doc files
        return 1 if $content_type =~ m[^application/pdf]i;  # allow pdf
        return 0;


<title>404 - NOT FOUND</title>
<?php echo '404 - NOT FOUND'; ?>

Sent by Lyle Jensen

Users mailing list
Received on Fri Nov 5 14:50:17 2010