Wow, thanks a lot! It all makes sense now! The
lightbulb finally turned on, lol.
I edited Doc2txt.pm like you showed, and now I'm
trying to write a Ppt2txt.pm. There isn't a binary
that converts ppt to txt, but rather html (ala
ppthtml). The only problem is, the <TITLE/> is the
full filename and path, which, with SWISH-E, makes it
like /tmp/sddwt4g490 or whatever.
I know I can pipe the output through w3m with some
options to strip the HTML tags to make it text, but
I'm having a hard time figuring out how to make it
work in a module. Using the doc2txt.pm as an example,
I tried about 20 different things I was hoping would
work but no luck.
How would I change the line...
my $content = $filter->run_program( $self->{ppthtml},
$file )
To do the bash equivilent of...
ppthtml [filegoeshere] | w3m -dump -T text/html | perl
-pe 's/\xa0/ /g'
?
Unless someone is hordeing a Ppt2txt.pm, I love some
help :)
I greatly appreciate all the help thus far! If I can
get this working, we'll go live with it!
--- Bill Moseley <moseley@hank.org> wrote:
> On Wed, Jul 07, 2004 at 11:33:36AM -0700, Alan Ivey
> wrote:
> > @servers = (
> > # Localhost
> > {
> > skip => 0,
> >
> > base_url => 'http://localhost',
> > same_hosts => [ qw/127.0.0.1/ ],
> > agent => 'swish-e spider
> > http://swish-e.org/',
> > email => 'alan@localhost',
> >
> > delay_sec => 2,
>
> Turn on Keep Alives and don't use a delay.
>
>
> > max_time => 10,
> > max_files => 100,
> > max_indexed => 20,
> > keep_alive => 1,
> > filter_content => \&filter_content,
> > },
> > );
> >
> > I've read the Docs serveral times, and searched on
> the
> > mailing list, and I'm just not getting it. But
> like
> > I've said before on this list, I'm recently new to
> > Linux, and I don't really know much of anything in
> the
> > way of Perl. So, my question is... do I just have
> to
> > put modules in the
> > /usr/local/lib/swish-e/perl/SWISH/Filters
> directory,
> > and then they'll automatically be processed? Don't
> I
> > have to set the content type somewhere? Wherever
> they
> > go doesn't jump out to me, a newbie in the sample
> > file.
>
> Well, first read
> http://swish-e.org/current/docs/Filter.html
> that should give some overview. Then just pick an
> existing filter and
> copy it as your new filter.
>
> You can put the filters anyplace, they just need to
> be in the
> SWISH::Filters name space. It's not as complex as
> it sounds --
> SWISH::Filter (SWISH/Filter.pm) takes perl @INC
> array and appends each
> path with "SWISH/Filters" to make a full path to a
> directory. It
> think looks in that directory for filters.
>
> So, you can make a file called $HOME/SWISH/Filters
> and add a module
> called PowerPoint.pm to it (the module is
> SWISH::Filters::PowerPoint)
> and then set PERL5LIB=$HOME and SWISH::Filter will
> find the module.
>
> That make any sense? SWISH::Filter uses @INC to
> find the filters.
>
> > I wish I knew more Perl :( Tis frustrating.
>
> Me too.
>
> > I ran swish-filter-test and it seems there needs
> to be
> > more than just an existing module. The first time
> I
> > ran it, it said I needed MIME::Type and
> MIME::Types so
> > I added those to a suitable Perl folder. Here's
> the
> > results of my .doc test, even with Doc2txt.pm
> being in
> > the SWISH Filter folder...
>
> MIME::Type shouldn't be required -- it's just used
> if available to map
> from file extensions to content-types. There's a
> few built in maps if
> MIME::Types isn't installed. But PowerPoint is not
> in there by
> default.
>
>
> > >> Loading filter: [SWISH/Filters/Doc2txt.pm]
> > Find path of [catdoc] in
> >
> /usr/local/bin:/usr/bin:/bin:/usr/local/lib/swish-e
> > * Found program at: [/usr/local/bin/catdoc]
>
> Ok, so that filter found "catdoc" so it's available.
>
>
> >
> > >> Starting to process new document:
> > application/x-msword
>
> And your document (from MIME::Types, I guess) is
> marked as x-msword.
>
> > ++Checking filter
> > [SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)] for
> > application/x-msword
> > ++ application/x-msword was not filtered by
> > SWISH::Filters::Doc2txt=HASH(0x8ff3b3c)
>
> For some reason Doc2txt didn't accept the file for
> filtering.
> What SWISH::Filter does is pass the document to all
> filters,
> one-by-one until it's accepted by a filter. It's up
> to the filter to
> determine if it can filter the document -- normally
> by checking the
> content type.
>
> It MAY be that Doc2txt doesn't know about that
> content type. I think
> at one point it only checked for application/msword
> and then
> MIME::Types was updated for x-msword. But I'm not
> sure. Just look at
> Doc2txt.pm and see what it does.
>
> moseley@bumby:~/swish-e/filters/SWISH/Filters$ fgrep
> msword Doc2txt.pm
> return unless $filter->content_type =~
> m!application/(x-)?msword!;
>
> So the filter is just returning if the content type
> doesn't match.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
> Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
>
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
Received on Thu Jul 8 06:52:49 2004