Skip to main content.
home | support | download

Back to List Archive

FW: Re: Filtering problems

From: Klingensmith, Rick <klingensmith(at)not-real.hr.msu.edu>
Date: Fri Sep 19 2003 - 19:05:33 GMT
OK, I've seen the light and am switching over to use spider.pl. So far I've
gotten it to use SwishSpiderConfig.pl and point to my local host to find 4
URLs (which is correct). However, it is not indexing the output and it's
probably issues with the filter object? Here is the output from swish-e
using the following command line:

C:\SWISH-E>C:\Swish-E\swish-e -S prog -c C:\Swish-E\conf\siteindexpl.config

Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp' will be
over
ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'
Indexing Data Source: "External-Program"
Indexing "prog-bin/spider.pl"
External Program found: ./prog-bin/spider.pl
C:\SWISH-E\prog-bin\spider.pl: Reading parameters from
'SwishSpiderConfig.pl'

 -- Starting to spider: http://localhost/ --
?Testing 'test_url' user supplied function #1 'http://localhost/'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1 'http://localhost/'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'test_url' user supplied function #1
'http://localhost/affidavit.pdf'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://localhost/TerminationCheck
list.pdf'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1 'http://localhost/OEHowTo.pdf'
+Passed all 1 tests for 'test_url' user supplied function
! Found 3 links in http://localhost/index.htm

?Testing 'filter_content' user supplied function #1 'http://localhost/'
?Testing 'test_response' user supplied function #1
'http://localhost/affidavit.p
df'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'filter_content' user supplied function #1
'http://localhost/affidavit.
pdf'
?Testing 'test_response' user supplied function #1
'http://localhost/Termination
Checklist.pdf'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'filter_content' user supplied function #1
'http://localhost/Terminatio
nChecklist.pdf'
?Testing 'test_response' user supplied function #1
'http://localhost/OEHowTo.pdf
'
+Passed all 1 tests for 'test_response' user supplied function

Summary for: http://localhost/
Connection: Keep-Alive: 3  (1.5/sec)
               Skipped: 4  (2.0/sec)
           Unique URLs: 4  (2.0/sec)

Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.

I can understand the tempdir warning which is no problem. I'm not sure how
to get Swish-e to actually build the index. This is my siteindexpl.config
file:

# Include our site-wide configuration settings:

IncludeConfigFile conf/settings.config

# Specify the program to run
IndexDir prog-bin/spider.pl


# When running under the "prog" document source method you can
# pass a list of parameters to the program (specified with -i or IndexDir).

# If a parameter is passed to spider.pl, it will use that as the
configuration
# file.

# As a special case, the word "default" followed by URL(s).
# In this case the spider will use default settings to spider the provided
URLs.

# SwishProgParameters default http://35.8.31.67
# SwishProgParameters default http://www.hr.msu.edu/hrsite

# Note: the default used by spider.pl is SwishSpiderConfig.pl.
# See prog-bin/SwishSpiderConfig.pl for examples
# that include filtering PDF and MS Word documents.

# (default /var/tmp)  The location of a writeable temp directory
# on your system.  The HTTP access method tells the Perl helper to place
# its files there.  The default is defined in src/config.h and depends on
# the current OS.

TmpDir C:/Inetpub/Indexes/Temp

# Tell swish that about how to parse the content
DefaultContents HTML
IndexContents HTML .htm .html
FileFilter .pdf filter-bin/pdf2html
IndexContents HTML .pdf

IndexComments no

# Just to make it interesting, let's modify the URL that get's indexed:
# replace http://swish-e.org/ => http:/localhost/

# ReplaceRules replace swish-e.org localhost




This is the SwishSpiderConfig.pl file:

#--------------------- Global Config ----------------------------

#  @servers is a list of hashes -- so you can spider more than one site
#  in one run (or different parts of the same tree)
#  The main program expects to use this array (@SwishSpiderConfig::servers).

  ### Please do not spider these examples -- spider your own servers, with
permission ####

@servers = (

 
#===========================================================================
==
    # This is a more advanced example that uses more features,
    # such as ignoring some file extensions, and only indexing
    # some content-types, plus filters PDF and MS Word docs.
    # The call-back subroutines are explained a bit more below.
    {
        skip        => 0,  # skip spidering this server
        debug       => DEBUG_INFO,  # print some debugging info to STDERR

      #  debug       => DEBUG_URL,  # print some debugging info to STDERR


      #  base_url        => 'http://www.swish-e.org/',
        base_url        => 'http://localhost/',
      #  base_url        => 'http://www.hr.msu.edu/hrsite/',
        email           => 'webmaster@hr.msu.edu',
        link_tags       => [qw/ a frame /],
        delay_sec       => 30,        # Delay in seconds between requests
        max_files       => 50,         
        max_indexed     => 20,        # Max number of files to send to swish
for indexing

        max_size        => 1_000_000,  # limit to 1MB file size
        max_depth       => 10,         # spider only ten levels deep
        keep_alive      => 1,

        test_url        => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },

        test_response   => sub {
            my $content_type = $_[2]->content_type;
            my $ok = grep { $_ eq $content_type } qw{ text/html text/plain
application/pdf };

            # This might be used if you only wanted to index PDF files, yet
spider still spider.
            #$_[1]->{no_index} = $content_type ne 'application/pdf';

            return 1 if $ok;
            print STDERR "$_[0] wrong content type ( $content_type )\n";
            return;
        },

        filter_content  => [ \&pdf],
    },



);    

I have not changed the global functions in SwishSpiderConfig.pl from the
windows daily build download dated 9/17/2003.

Swish_filter.pl is in the c:/swish-e/filter-bin/ subdirectory and I've also
tried moving it to the c:/swish-e directory where I'm executing the command
to run swish-e and received the same results.

My filter.pm module is located in C:/Swish-e/filters/swish/ subdirectory. I
have applied the following lines of code to the windows_fork subroutine:

sub windows_fork {
    my ( $self, @args ) = @_;


    require IPC::Open2;
    my ( $rdrfh, $wtrfh );

    # Added these three lines per instructions from Bill Moseley 7/29/2003
    my $path = join " ", @args;
    open FH, "$path|" or die $!;
    return \*FH;

    my @command = map { s/"/\\"/g; qq["$_"] }  @args;

I get the same results with or without the lines even after moving the
module to the C:/Swish-e directory.

Where am I going wrong or do I need to give you more information? 

Rick

> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Thursday, September 18, 2003 7:06 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Filtering problems
> 
> On Thu, Sep 18, 2003 at 03:23:31PM -0700, Klingensmith, Rick wrote:
> 
> > I've caused myself some more problems with filtering PDF documents I
> > believe. I've installed the latest windows install exe on my test server
> and
> > modified windows fork in filter.pm. This was to get around a memory
> issue
> > that started, which we couldn't solve. Now I'm getting the following
> error
> > message when swish-e tries to index a pdf:
> >
> > retrieving http://35.8.31.67/affidavit.pdf (1)...
> >
> > Can't locate object method "convert" via package "SWISH::Filter" at
> > C:/Swish-E/swishspider line 149.
> 
> I already responded to Rick by email, but for the list (and archive):
> 
> SWISH::Filter was updated.  Before to filter a document
> 
>    $filtered = $filter->filter(...)
> 
> which returned true or false.  But that's not a very Object Oriented
> interface so I added a new method:
> 
>    $doc = $filter->convert(...)
> 
> which returns an object "$doc".
> 
> The programs swishspider and spider.pl were updated to use that new
> interface.
> 
> Rick's problem (so I assume) is that he's using a new version of
> swishspider, but an old version of SWISH::Filter.  I assume that
> happened because he's got a "use lib" line in swishspider pointing to
> an old version of SWISH::Filter.
> 
> But swishspider is an exception in that it doesn't automatically point
> to where SWISH::Filter is installed.  In other words, swishspider
> doesn't use SWISH::Filter by default because (unlike spider.pl)
> swishspider runs for each document spidered.  That would mean loading
> SWISH::Filter (and all the associated filter modules) over and over.
> 
> The better solution is to use spider.pl instead of swishspider.
> 
> Much of the work in getting 2.4.0 released is getting Windows to install
> (and use) things in their right place.  So perhaps that was the problem.
> 
> Why doesn't Microsoft follow Apple's lead and replace their OS with BSD?
> 
> --
> Bill Moseley
> moseley@hank.org
Received on Fri Sep 19 19:06:00 2003