See below for answers to your questions. I'm beginning to wonder at this
point if I should abandon this approach and try -prog. The only problem is I
was having fun trying to figure out a configuration file to crawl my site. I
thought it was using filters too and would have the same problem.
> -----Original Message-----
> From: firstname.lastname@example.org [mailto:email@example.com]
> Sent: Tuesday, July 29, 2003 10:48 AM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: FW: Re: More Trouble with Filters
> On Tue, Jul 29, 2003 at 06:18:34AM -0700, Klingensmith, Rick wrote:
> > I'm probably beginning to sound like a flake, but I've got myself very
> > confused at this point. I've used the following config file and added a
> > use lib line to the swishspider file:
> > SpiderDirectory C:/Swish-E
> > # Use the file filter to index pdf files
> > #FileFilter .pdf c:/SWISH-E/filter-bin/_pdf2html.pl '"%p" -'
> > #FileFilter .pdf c:/SWISH-E/filter-bin/pdftotext.exe '"%p" -'
> > # Filter Directory
> > FilterDir C:/SWISH-E/filter-bin
> Hi Rick,
> I haven't looked at the filter code in a while. FilterDir is prepended
> to the program specified with FileFilter when the program doesn't start
> with a "/". That doesn't work very well on Windows (or anywhere
> really). I'd suspect in your case it's trying to run a program called:
> (although you have those filters commented out)
> > Swishspider is in my SWISH-e directory. With this configuration the pdf
> > files indexed correctly, but I'm still getting the same output on the
> > tags as below in my previous post.
> You mean malformed meta tags like you posted? I guess I need to see if
> I can get my windows machine to boot and try a few things. Can you make
> available online a test PDF file? You can send it directly to me if you
> don't want it public -- although it would be helpful for me to fetch if
> from the same URL you are using.
> So, provide me with
> - a URL to fetch a test PDF file
22.214.171.124/affidavit.pdf should get you to the pdf in question. There is
another pdf named terminationchecklist.pdf in the directory too which causes
the same problems. I do have a firewall running on my pc so if you have
problems let me know your ip and I'll allow it in.
> - the version of swish-e you are using (perhaps a link to the specific
> version you installed)
When I run swish-e -h I'm getting version 2.2.3 and have been running with
the DEBUG_FILTER set to 1. I just downloaded the latest version about 3
weeks ago and assumed I had the latest. I used the swish-e-2.2.3-win32exe
link to download the version I have.
> - answer if your swishspider has the lines shown below
My Swishspider does contain the lines below. Here are the top lines from my
print STDERR "spider $$ [@ARGV]\n";
# SWISH-E http method Spider
# $Id: swishspider,v 1.9 2002/09/09 07:15:19 whmoseley Exp $
# Should SWISH::Filter be use for filtering? This can be left 1 all the
# will add a little time to processing since.
use constant USE_FILTERS => 1; # 1 = yes use SWISH::Filter for filtering,
0 = no. (faster processing if not set)
use constant FILTER_TEXT => 0; # set to one to filter text/* content, 0
will save processing time
use constant DEBUG_FILTER => 1; # set to one to report errors on loading
use HTML::Parser 3.00;
> And I'll try it on my Windows machine (if it will boot).
> > I thought I was using the SWISH::Filter by default, but now I'm not
> > When I use the FileFilter directive in the config file I get the errors
> > pdf is invalid. Once I commented both lines out at least it indexed the
> > without error. The FilterDir directive doesn't seem to matter I get the
> > output with or without it. I did confirm that the document is being
> > with a search for words that only appear in the pdf with the correct
> > results.
> > My perl/site/lib/swish subdirectory contains filter.pm and
> > perl/site/lib/swish/filters contain the other filter modules. I'm
> > this is a simple configuration issue, but my perl knowledge is limited
> > debugging has been a problem.
> I'm not exactly clear what version you are running. Does your
> swishspider have this at the top?:
> # Should SWISH::Filter be use for filtering? This can be left 1 all the
> time, but
> # will add a little time to processing since.
> use constant USE_FILTERS => 1; # 1 = yes use SWISH::Filter for
> filtering, 0 = no. (faster processing if not set)
> use constant FILTER_TEXT => 0; # set to one to filter text/* content, 0
> will save processing time
> use constant DEBUG_FILTER => 0; # set to one to report errors on loading
> SWISH::Filter module.
> Many things regarding installation have changed for the 2.4.0 version --
> namely most things get installed in places so that you don't need to
> specify paths and set Perl libraries locations. That will make things
> much easier in the future.
> If your swishspider has those lines above then it's designed to work
> with the SWISH::Filter modules. By default (and this is something to
> possibly change), swishspider doesn't know where SWISH::Filter is
> installed. That's on purpose because I didn't want swishspider using
> them those filters by default. Why? Because the way swish-e works with
> -S http is that it calls swishspider for every URL fetched and that's
> slow (due the the compiling of the swishspider Perl script). Making it
> load all the SWISH::Filter modules would be a lot more work for every
> request. Using -S prog (and spider.pl) avoids all that.
> So, to have swishspider use SWISH::Filter you either have to set a
> PERL5LIB environment variable or add a "use lib" line to the top of
> swishspider. Both do the same by adding paths to Perl's @INC array so
> Perl can find the modules.
> So if you have the above lines in swishspider then you can set
> DEBUG_FILTER => 1 and it will tell you if swishspider was able to load
> the SWISH::Filter module (and SWISH::Filters::* filter modules).
> Then, what you want to do is run swishspider without running swish-e:
> perl swishspider prefix http://localhost/test.pdf
> Then you should have in the current directory a file called
> prefix.contents and prefix.response (contains the HTTP response code),
> and maybe a prefix.links (if the file is HTML and has links to follow).
> That will tell you if the SWISH::Filter module is being used (well,
> really it will tell you if it's not being used if DEBUG_FILTER is set).
> Then you can look at prefix.contents and see the output that is being
> created. If you then see the messed up meta tags then it's a problem
> with the way the SWISH::Filter is working under Windows.
> prefix.response should have text/html in it if the file was filtered,
> otherwise it will have application/pdf.
I ran swishspider this way and the contents file contained the page with the
malformed meta tags and the response file contained the text/html line. The
only thing I could see wrong with the files is the meta tags.
> Bill Moseley
Received on Tue Jul 29 16:22:54 2003