Skip to main content.
home | support | download

Back to List Archive

Re: PowerPoint module for spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Jul 07 2004 - 19:52:57 GMT
On Wed, Jul 07, 2004 at 11:33:36AM -0700, Alan Ivey wrote:
> @servers = (
>     # Localhost
>     {
>         skip        => 0,
>         
>         base_url    => 'http://localhost',
>         same_hosts  => [ qw/127.0.0.1/ ],
>         agent       => 'swish-e spider
> http://swish-e.org/',
>         email       => 'alan@localhost',
> 
>         delay_sec   => 2,

Turn on Keep Alives and don't use a delay.


>         max_time    => 10,
>         max_files   => 100,
>         max_indexed => 20, 
>         keep_alive  => 1,  
>         filter_content  => \&filter_content,
>     },
> );    
> 
> I've read the Docs serveral times, and searched on the
> mailing list, and I'm just not getting it. But like
> I've said before on this list, I'm recently new to
> Linux, and I don't really know much of anything in the
> way of Perl. So, my question is... do I just have to
> put modules in the
> /usr/local/lib/swish-e/perl/SWISH/Filters directory,
> and then they'll automatically be processed? Don't I
> have to set the content type somewhere? Wherever they
> go doesn't jump out to me, a newbie in the sample
> file.

Well, first read http://swish-e.org/current/docs/Filter.html
that should give some overview.  Then just pick an existing filter and
copy it as your new filter.

You can put the filters anyplace, they just need to be in the
SWISH::Filters name space.  It's not as complex as it sounds --
SWISH::Filter (SWISH/Filter.pm) takes perl @INC array and appends each
path with "SWISH/Filters" to make a full path to a directory.  It
think looks in that directory for filters.

So, you can make a file called $HOME/SWISH/Filters and add a module
called PowerPoint.pm to it (the module is SWISH::Filters::PowerPoint)
and then set PERL5LIB=$HOME and SWISH::Filter will find the module.

That make any sense?  SWISH::Filter uses @INC to find the filters.

> I wish I knew more Perl :( Tis frustrating.

Me too.

> I ran swish-filter-test and it seems there needs to be
> more than just an existing module. The first time I
> ran it, it said I needed MIME::Type and MIME::Types so
> I added those to a suitable Perl folder. Here's the
> results of my .doc test, even with Doc2txt.pm being in
> the SWISH Filter folder...

MIME::Type shouldn't be required -- it's just used if available to map
from file extensions to content-types.  There's a few built in maps if
MIME::Types isn't installed.  But PowerPoint is not in there by
default.


> >> Loading filter: [SWISH/Filters/Doc2txt.pm]
> Find path of [catdoc] in
> /usr/local/bin:/usr/bin:/bin:/usr/local/lib/swish-e
>  * Found program at: [/usr/local/bin/catdoc]

Ok, so that filter found "catdoc" so it's available.


>  
> >> Starting to process new document:
> application/x-msword

And your document (from MIME::Types, I guess) is marked as x-msword.

>  ++Checking filter
> [SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)] for
> application/x-msword
>  ++ application/x-msword was not filtered by
> SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)

For some reason Doc2txt didn't accept the file for filtering.
What SWISH::Filter does is pass the document to all filters,
one-by-one until it's accepted by a filter.  It's up to the filter to
determine if it can filter the document -- normally by checking the
content type.

It MAY be that Doc2txt doesn't know about that content type.  I think
at one point it only checked for application/msword and then
MIME::Types was updated for x-msword.  But I'm not sure.  Just look at
Doc2txt.pm and see what it does.

moseley@bumby:~/swish-e/filters/SWISH/Filters$ fgrep msword Doc2txt.pm 
    return unless $filter->content_type =~ m!application/(x-)?msword!;

So the filter is just returning if the content type doesn't match.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Jul 7 12:53:08 2004