FW: Re: Filtering problems

From: Klingensmith, Rick <klingensmith(at)>
Date: Fri Sep 19 2003 - 19:05:33 GMT
OK, I've seen the light and am switching over to use So far I've
gotten it to use and point to my local host to find 4
URLs (which is correct). However, it is not indexing the output and it's
probably issues with the filter object? Here is the output from swish-e
using the following command line:

C:\SWISH-E>C:\Swish-E\swish-e -S prog -c C:\Swish-E\conf\siteindexpl.config

Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp' will be
ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'
Indexing Data Source: "External-Program"
Indexing "prog-bin/"
External Program found: ./prog-bin/
C:\SWISH-E\prog-bin\ Reading parameters from

 -- Starting to spider: http://localhost/ --
?Testing 'test_url' user supplied function #1 'http://localhost/'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1 'http://localhost/'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'test_url' user supplied function #1
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1 'http://localhost/OEHowTo.pdf'
+Passed all 1 tests for 'test_url' user supplied function
! Found 3 links in http://localhost/index.htm

?Testing 'filter_content' user supplied function #1 'http://localhost/'
?Testing 'test_response' user supplied function #1
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'filter_content' user supplied function #1
?Testing 'test_response' user supplied function #1
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'filter_content' user supplied function #1
?Testing 'test_response' user supplied function #1
+Passed all 1 tests for 'test_response' user supplied function

Summary for: http://localhost/
Connection: Keep-Alive: 3  (1.5/sec)
               Skipped: 4  (2.0/sec)
           Unique URLs: 4  (2.0/sec)

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!

I can understand the tempdir warning which is no problem. I'm not sure how
to get Swish-e to actually build the index. This is my siteindexpl.config

# Include our site-wide configuration settings:

IncludeConfigFile conf/settings.config

# Specify the program to run
IndexDir prog-bin/

# When running under the "prog" document source method you can
# pass a list of parameters to the program (specified with -i or IndexDir).

# If a parameter is passed to, it will use that as the
# file.

# As a special case, the word "default" followed by URL(s).
# In this case the spider will use default settings to spider the provided

# SwishProgParameters default
# SwishProgParameters default

# Note: the default used by is
# See prog-bin/ for examples
# that include filtering PDF and MS Word documents.

# (default /var/tmp)  The location of a writeable temp directory
# on your system.  The HTTP access method tells the Perl helper to place
# its files there.  The default is defined in src/config.h and depends on
# the current OS.

TmpDir C:/Inetpub/Indexes/Temp

# Tell swish that about how to parse the content
DefaultContents HTML
IndexContents HTML .htm .html
FileFilter .pdf filter-bin/pdf2html
IndexContents HTML .pdf

IndexComments no

# Just to make it interesting, let's modify the URL that get's indexed:
# replace => http:/localhost/

# ReplaceRules replace localhost

This is the file:

#--------------------- Global Config ----------------------------

#  @servers is a list of hashes -- so you can spider more than one site
#  in one run (or different parts of the same tree)
#  The main program expects to use this array (@SwishSpiderConfig::servers).

  ### Please do not spider these examples -- spider your own servers, with
permission ####

@servers = (

    # This is a more advanced example that uses more features,
    # such as ignoring some file extensions, and only indexing
    # some content-types, plus filters PDF and MS Word docs.
    # The call-back subroutines are explained a bit more below.
        skip        => 0,  # skip spidering this server
        debug       => DEBUG_INFO,  # print some debugging info to STDERR

      #  debug       => DEBUG_URL,  # print some debugging info to STDERR

      #  base_url        => '',
        base_url        => 'http://localhost/',
      #  base_url        => '',
        email           => '',
        link_tags       => [qw/ a frame /],
        delay_sec       => 30,        # Delay in seconds between requests
        max_files       => 50,         
        max_indexed     => 20,        # Max number of files to send to swish
for indexing

        max_size        => 1_000_000,  # limit to 1MB file size
        max_depth       => 10,         # spider only ten levels deep
        keep_alive      => 1,

        test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } qw{ text/html text/plain
application/pdf };

            # This might be used if you only wanted to index PDF files, yet
spider still spider.
            #$_[1]->{no_index} = $content_type ne 'application/pdf';

            return 1 if $ok;
            print STDERR "$_[0] wrong content type ( $content_type )\n";

        filter_content  => [ \&pdf],


I have not changed the global functions in from the
windows daily build download dated 9/17/2003. is in the c:/swish-e/filter-bin/ subdirectory and I've also
tried moving it to the c:/swish-e directory where I'm executing the command
to run swish-e and received the same results.

My module is located in C:/Swish-e/filters/swish/ subdirectory. I
have applied the following lines of code to the windows_fork subroutine:

sub windows_fork {
    my ( $self, @args ) = @_;

    require IPC::Open2;
    my ( $rdrfh, $wtrfh );

    # Added these three lines per instructions from Bill Moseley 7/29/2003
    my $path = join " ", @args;
    open FH, "$path|" or die $!;
    return \*FH;

    my @command = map { s/"/\\"/g; qq["$_"] }  @args;

I get the same results with or without the lines even after moving the
module to the C:/Swish-e directory.

Where am I going wrong or do I need to give you more information? 


