Skip to main content.
home | support | download

Back to List Archive

[swish-e] Indexing problem

From: Lyle Jensen <lyle.jensen(at)not-real.gmail.com>
Date: Fri Nov 05 2010 - 18:50:14 GMT
I'm having trouble getting SWISH-e to work with IIS unless Directory
Browsing is turned on, and I don't want to do that.

SWISH-e runs on the same server as IIS.  The desired content in a virtual
folder, /docs.  Underneath that folder are several additional folders at the
next level containing pdf & doc files.  If I run SWISH-e with Directory
Browsing turned on for the /docs virtual folder, everything indexes as
expected.  However we don't want to allow wide open access to browsing those
directories.  If add a default document, nobrowse.php to /docs and the each
folder below it, indexing fails.  It gets to the nobrowse.php document and
stops.  And of course, if I turn Directory Browsing off, indexing fails.

How can I get SWISH-e to index the files in /docs?

Thanks!



Details:
SWISH-e v2.43
Windows Server 2003 Enterprise x64, IIS 6.0

command line: c:\SWISH-e\swish-e.exe  -e -v 3 -c C:\swish-e\swish.conf -S
prog 1>C:\swish-e\swish_stdout.txt 2>C:\swish-e\swish_stderr.txt

swish.conf:
IndexDir /perl/bin/perl.exe
IndexFile /inetpub/wwwroot/swish/index.swish-e
SwishProgParameters
c:/swish-e/lib/swish-e/spider.plc:/swish-e/lib/swish-e/SwishSpiderConfig.pl
IndexOnly .html .htm .pdf .doc

SwishSpiderConfig.pl:
@servers = (
  {
  use_default_config  => 1,
  email               => 'me@mysite.com',
  base_url            => 'https://www.mysite.com/docs',
    test_url            => \&test_url,
    test_response            => \&response_sub,
      # delay_sec should be commented out in production
      delay_sec           => 0,
    max_time            => 90,    # Max time to spider in minutes - changed
19Oct10 lj
    max_wait_time    => 180,    # Max time in seconds for spider to wait for
data to be returned - added 19Oct10 lj
    max_size                => 0,    # Override max size of 5mb
    keep_alive          => 1,
    #This is OK if we are indexing our own site
    ignore_robots_file  => 1,
    #Use this one in production
    debug               => 'skipped,errors',
  },
);

sub test_url {
  my ( $uri, $server ) = @_;
  # return 1;  # Ok to index/spider
  # return 0;  # No, don't index or spider

  # make sure that the path is limited to the swish path
  #print STDERR "Checking $uri->path\r\n";
  return 0 if $uri->path !~ m[^/docs]i;
  #return 0 if $uri->path =~ m[^/docs/save]i;

  # ignore any of these file types
  if ($uri->path =~
/\.(css|gif|jpeg|jpg|png|asp|php|ppt|pptx|mp4|wmv|asx|msi|arf)?$/i ) {
    #print STDERR "Skipping $uri->path, this file type is excluded\r\n";
    return 0;
  }
  return 1;
}

  # This is used in HEAD request to test the content type ahead of time
  sub response_sub {

        my ( $uri, $server, $response, $content_ref ) = @_;
        my $content_type = $response->content_type;
        return 1 if $content_type =~ m!^text/!;  # allow all text (assume we
don't want to filter)
        return 1 if $content_type =~ m[^application/msword]i; # allow word
doc files
        return 1 if $content_type =~ m[^application/pdf]i;  # allow pdf
files
        return 0;
    };

    1;

nobrowse.php:
<html>
<head>
<title>404 - NOT FOUND</title>
</head>
<body>
<?php echo '404 - NOT FOUND'; ?>
</body>
</html>



-- 
Sent by Lyle Jensen


_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Nov 5 14:50:17 2010