OK, I've seen the light and am switching over to use spider.pl. So far I've
gotten it to use SwishSpiderConfig.pl and point to my local host to find 4
URLs (which is correct). However, it is not indexing the output and it's
probably issues with the filter object? Here is the output from swish-e
using the following command line:
C:\SWISH-E>C:\Swish-E\swish-e -S prog -c C:\Swish-E\conf\siteindexpl.config
Warning: Configuration setting for TmpDir 'C:/Inetpub/Indexes/Temp' will be
over
ridden by environment setting 'C:\DOCUME~1\klingen2\LOCALS~1\Temp'
Indexing Data Source: "External-Program"
Indexing "prog-bin/spider.pl"
External Program found: ./prog-bin/spider.pl
C:\SWISH-E\prog-bin\spider.pl: Reading parameters from
'SwishSpiderConfig.pl'
-- Starting to spider: http://localhost/ --
?Testing 'test_url' user supplied function #1 'http://localhost/'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_response' user supplied function #1 'http://localhost/'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'test_url' user supplied function #1
'http://localhost/affidavit.pdf'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1
'http://localhost/TerminationCheck
list.pdf'
+Passed all 1 tests for 'test_url' user supplied function
?Testing 'test_url' user supplied function #1 'http://localhost/OEHowTo.pdf'
+Passed all 1 tests for 'test_url' user supplied function
! Found 3 links in http://localhost/index.htm
?Testing 'filter_content' user supplied function #1 'http://localhost/'
?Testing 'test_response' user supplied function #1
'http://localhost/affidavit.p
df'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'filter_content' user supplied function #1
'http://localhost/affidavit.
pdf'
?Testing 'test_response' user supplied function #1
'http://localhost/Termination
Checklist.pdf'
+Passed all 1 tests for 'test_response' user supplied function
?Testing 'filter_content' user supplied function #1
'http://localhost/Terminatio
nChecklist.pdf'
?Testing 'test_response' user supplied function #1
'http://localhost/OEHowTo.pdf
'
+Passed all 1 tests for 'test_response' user supplied function
Summary for: http://localhost/
Connection: Keep-Alive: 3 (1.5/sec)
Skipped: 4 (2.0/sec)
Unique URLs: 4 (2.0/sec)
Removing very common words...
no words removed.
Writing main index...
err: No unique words indexed!
.
I can understand the tempdir warning which is no problem. I'm not sure how
to get Swish-e to actually build the index. This is my siteindexpl.config
file:
# Include our site-wide configuration settings:
IncludeConfigFile conf/settings.config
# Specify the program to run
IndexDir prog-bin/spider.pl
# When running under the "prog" document source method you can
# pass a list of parameters to the program (specified with -i or IndexDir).
# If a parameter is passed to spider.pl, it will use that as the
configuration
# file.
# As a special case, the word "default" followed by URL(s).
# In this case the spider will use default settings to spider the provided
URLs.
# SwishProgParameters default http://35.8.31.67
# SwishProgParameters default http://www.hr.msu.edu/hrsite
# Note: the default used by spider.pl is SwishSpiderConfig.pl.
# See prog-bin/SwishSpiderConfig.pl for examples
# that include filtering PDF and MS Word documents.
# (default /var/tmp) The location of a writeable temp directory
# on your system. The HTTP access method tells the Perl helper to place
# its files there. The default is defined in src/config.h and depends on
# the current OS.
TmpDir C:/Inetpub/Indexes/Temp
# Tell swish that about how to parse the content
DefaultContents HTML
IndexContents HTML .htm .html
FileFilter .pdf filter-bin/pdf2html
IndexContents HTML .pdf
IndexComments no
# Just to make it interesting, let's modify the URL that get's indexed:
# replace http://swish-e.org/ => http:/localhost/
# ReplaceRules replace swish-e.org localhost
This is the SwishSpiderConfig.pl file:
#--------------------- Global Config ----------------------------
# @servers is a list of hashes -- so you can spider more than one site
# in one run (or different parts of the same tree)
# The main program expects to use this array (@SwishSpiderConfig::servers).
### Please do not spider these examples -- spider your own servers, with
permission ####
@servers = (
#===========================================================================
==
# This is a more advanced example that uses more features,
# such as ignoring some file extensions, and only indexing
# some content-types, plus filters PDF and MS Word docs.
# The call-back subroutines are explained a bit more below.
{
skip => 0, # skip spidering this server
debug => DEBUG_INFO, # print some debugging info to STDERR
# debug => DEBUG_URL, # print some debugging info to STDERR
# base_url => 'http://www.swish-e.org/',
base_url => 'http://localhost/',
# base_url => 'http://www.hr.msu.edu/hrsite/',
email => 'webmaster@hr.msu.edu',
link_tags => [qw/ a frame /],
delay_sec => 30, # Delay in seconds between requests
max_files => 50,
max_indexed => 20, # Max number of files to send to swish
for indexing
max_size => 1_000_000, # limit to 1MB file size
max_depth => 10, # spider only ten levels deep
keep_alive => 1,
test_url => sub { $_[0]->path !~ /\.(?:gif|jpeg)$/ },
test_response => sub {
my $content_type = $_[2]->content_type;
my $ok = grep { $_ eq $content_type } qw{ text/html text/plain
application/pdf };
# This might be used if you only wanted to index PDF files, yet
spider still spider.
#$_[1]->{no_index} = $content_type ne 'application/pdf';
return 1 if $ok;
print STDERR "$_[0] wrong content type ( $content_type )\n";
return;
},
filter_content => [ \&pdf],
},
);
I have not changed the global functions in SwishSpiderConfig.pl from the
windows daily build download dated 9/17/2003.
Swish_filter.pl is in the c:/swish-e/filter-bin/ subdirectory and I've also
tried moving it to the c:/swish-e directory where I'm executing the command
to run swish-e and received the same results.
My filter.pm module is located in C:/Swish-e/filters/swish/ subdirectory. I
have applied the following lines of code to the windows_fork subroutine:
sub windows_fork {
my ( $self, @args ) = @_;
require IPC::Open2;
my ( $rdrfh, $wtrfh );
# Added these three lines per instructions from Bill Moseley 7/29/2003
my $path = join " ", @args;
open FH, "$path|" or die $!;
return \*FH;
my @command = map { s/"/\\"/g; qq["$_"] } @args;
I get the same results with or without the lines even after moving the
module to the C:/Swish-e directory.
Where am I going wrong or do I need to give you more information?
Rick
> -----Original Message-----
> From: Bill Moseley [mailto:moseley@hank.org]
> Sent: Thursday, September 18, 2003 7:06 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: Filtering problems
>
> On Thu, Sep 18, 2003 at 03:23:31PM -0700, Klingensmith, Rick wrote:
>
> > I've caused myself some more problems with filtering PDF documents I
> > believe. I've installed the latest windows install exe on my test server
> and
> > modified windows fork in filter.pm. This was to get around a memory
> issue
> > that started, which we couldn't solve. Now I'm getting the following
> error
> > message when swish-e tries to index a pdf:
> >
> > retrieving http://35.8.31.67/affidavit.pdf (1)...
> >
> > Can't locate object method "convert" via package "SWISH::Filter" at
> > C:/Swish-E/swishspider line 149.
>
> I already responded to Rick by email, but for the list (and archive):
>
> SWISH::Filter was updated. Before to filter a document
>
> $filtered = $filter->filter(...)
>
> which returned true or false. But that's not a very Object Oriented
> interface so I added a new method:
>
> $doc = $filter->convert(...)
>
> which returns an object "$doc".
>
> The programs swishspider and spider.pl were updated to use that new
> interface.
>
> Rick's problem (so I assume) is that he's using a new version of
> swishspider, but an old version of SWISH::Filter. I assume that
> happened because he's got a "use lib" line in swishspider pointing to
> an old version of SWISH::Filter.
>
> But swishspider is an exception in that it doesn't automatically point
> to where SWISH::Filter is installed. In other words, swishspider
> doesn't use SWISH::Filter by default because (unlike spider.pl)
> swishspider runs for each document spidered. That would mean loading
> SWISH::Filter (and all the associated filter modules) over and over.
>
> The better solution is to use spider.pl instead of swishspider.
>
> Much of the work in getting 2.4.0 released is getting Windows to install
> (and use) things in their right place. So perhaps that was the problem.
>
> Why doesn't Microsoft follow Apple's lead and replace their OS with BSD?
>
> --
> Bill Moseley
> moseley@hank.org
Received on Fri Sep 19 19:06:00 2003