On Fri, Sep 19, 2003 at 12:05:00PM -0700, Klingensmith, Rick wrote:
> OK, I've seen the light and am switching over to use spider.pl. So far I've
> gotten it to use SwishSpiderConfig.pl and point to my local host to find 4
> URLs (which is correct). However, it is not indexing the output and it's
> probably issues with the filter object? Here is the output from swish-e
> using the following command line:
Hi Rick,
Sure you don't want to install Linux? Considering I just received over
300 "Internet Upgrade" messages "from" Microsoft today right on the
heals of Sobig seems like everyone would be ready to switch...
> Summary for: http://localhost/
> Connection: Keep-Alive: 3 (1.5/sec)
> Skipped: 4 (2.0/sec)
> Unique URLs: 4 (2.0/sec)
All docs were skipped, probably because SWISH::Filter was not loaded.
> debug => DEBUG_INFO, # print some debugging info to STDERR
debug => DEBUG_SKIPPED|DEBUG_INFO
will say why. I'm sure it was because they were not filtered correctly.
May I suggest you start out simple without a spider config file?
Dave sent me a Windows PR3 build the other day and we still had problems
with it working right because of PATH issues. I was able to get it
working (and indexing PDF files) with a few tweaks.
See, under unix all the paths get set correctly when running the
configure script, so swish-e and spider.pl know where to find things.
Under Windows we have to figure out where things are installed at run
time -- and try and keep the Windows version in sync with the way the
Unix version works.
Now, to summarize, when using -S prog swish-e uses the popen() system
call to run the external program. Under unix you can create a swish-e
config file "swish.conf" that contains just:
SwishProgParameters default http://localhost/index.html
and then run swish-e like:
swish-e -S prog -c swish.conf -i spider.pl
Swish-e will then look in the PATH and in $libexecdir (where spider.pl
was installed) for the program specified with -i (same thing as IndexDir
in a config file). Once that program is found swish-e calls popen with
the command:
/path/to/spider.pl default http://localhost/index.html
and reads input from spider.pl's stdout.
That "defaut" says to use spider.pl's default settings, which is a good
way to start. It will filter by default -- *if* it can find all the
filter parts that are needed.
Now, I only have Windows 98, so I can't run a program "spider.pl"
directly. I guess you can with Win2K and WinNT, though. So what I did
is created a spider.bat batch file to run spider.pl for me.
spider.bat:
perl /path/to/spider.pl %1 %2 %3 %4 %5 %6 %7
I put that in the same location as spider.pl was installed
(lib/swish-e/spider.pl).
Note:
The other way to do that is to run perl as the program:
SwishProgParameters /path/to/spider.pl default http://localhost/index.html
and then run swish-e like:
swish-e -S prog -c swish.conf -i perl.exe
But I think I like the batch file method better.
Once spider.pl is running it has to find the SWISH::Filter module, and
it does that by using the @INC array and that can be set using a "use
lib" line (see the top of spider.pl) or by setting the PERL5LIB
environment variable. When swish-e is installed in Windows that path is
suppose to be set correctly at the top of spider.pl. But if it isn't
you can set PERL5LIB. I think under Windows you can do:
set PERL5LIB=c:\<where you installed swish>\lib\swish-e
BTW -- That will change soon to match up with unix install to be
<install dir>\lib\swish-e\perl
so you will have to look and see where the SWISH directory is located
and then point to the directory above that (because Perl appends the
path \SWISH\Filter.pm when looking for the module.
Then to complicate things more, SWISH::Filter then has to locate the
program pdftotext to convert PDF files to text (HTML, really).
Those helper programs are installed in the $PATH under unix so that's
not a problem, but on Windows they are installed in $libexecdir. In
your version SWISH::Filter may not know to look in $libexecdir. That
has been fixed in cvs, but until there's a new windows version my
suggestion is to see where the Windows installer put those programs
(pdftotext, catdoc) and add that location to your PATH environment.
So, in summary, to get filtering to work (with SWISH::Filter) you need
to:
1) make sure windows can run spider.pl
use a spider.bat file if needed
2) make sure spider.pl can locate SWISH::Filter
set PERL5LIB or edit spider.pl's "use lib" line
3) make sure SWISH::Filter can locate the conversion program
add the location of the programs to your PATH
Now, how to debug things:
Set the environment variable for the spider debugging:
set SPIDER_DEBUG=url,skipped
(see spider.pl docs for details)
Then for debugging SWISH::Filter use:
set FILTER_DEBUG=1
I have added additional debugging lately that will show how
SWISH::Filter is searching for filter programs ("pdftotext") for
example.
So, once you get that working then add extra swish-e config settings as
needed. If you want more control over spidering then switch to using a
config file for spider.pl (following the examples in
SwishSpiderParameters.pl). But wait until you get the above working
correctly. No point in making things too complicated from the start.
You can look in spider.pl and search for "sub default_urls" to see the
config spider.pl uses when you specify "default".
Now:
> I have applied the following lines of code to the windows_fork
> subroutine:
>
> sub windows_fork {
> my ( $self, @args ) = @_;
>
>
> require IPC::Open2;
> my ( $rdrfh, $wtrfh );
>
> # Added these three lines per instructions from Bill Moseley 7/29/2003
> my $path = join " ", @args;
> open FH, "$path|" or die $!;
> return \*FH;
I can't remember (or see) why that was needed. Maybe it was the binmode
issue. The current code in SWISH::Filter looks like this:
sub windows_fork {
my ( $self, @args ) = @_;
require IPC::Open2;
my ( $rdrfh, $wtrfh );
my @command = map { s/"/\\"/g; qq["$_"] } @args;
my $pid = IPC::Open2::open2($rdrfh, $wtrfh, @command );
# IPC::Open3 uses binmode for some reason (5.6.1)
# Assume that the output from the program will be in text
# Maybe an invalid assumption if running through a binary filter
binmode $rdrfh, ':crlf'; # perhpaps: unless delete $self->{binary_output};
$self->{pid} = $pid;
return $rdrfh;
}
That, AFAICT, runs the program without going through the shell (even on
Windows). open2() calls system() with a 1 as the first parameter to
accomplish this. The painful part is that Windows seems to process the
double-quotes even when not going through the shell, so that's the
reason for the line:
my @command = map { s/"/\\"/g; qq["$_"] } @args;
which just escapes the double quotes so that phrase searches still work.
--
Bill Moseley
moseley@hank.org
Received on Sat Sep 20 01:02:10 2003