Skip to main content.
home | support | download

Back to List Archive

Made a filter for powerpoint (ppt), included. Have questions.

From: Randy <randyest(at)not-real.gmail.com>
Date: Wed Feb 02 2005 - 16:36:28 GMT
My recent htdig -> swish-e conversion was relatively quick and easy,
and I'm very happy with the results.  Thanks to everyone who
contributed to swish-e.

Once thing I was missing was a ppt filter; I saw a lot of requests for
such a filter in the archive, but no working code.  It wasn't hard to
make a basic working one, here it is (just put it in your Filters
directory, and make sure the ppthtml executable is in your path):

---- start ppt2html.pm ----
package SWISH::Filters::ppt2html;
use strict;
use vars qw/ $VERSION /;
$VERSION = '0.01';
sub new {
   my ( $class ) = @_;
   my $self = bless {
       mimetypes   => [ qr!application/vnd.ms-powerpoint! ],
   }, $class;
   return $self->set_programs( 'ppthtml' );
}

sub filter {
   my ( $self, $doc ) = @_;
   my $content = $self->run_ppthtml( $doc->fetch_filename ) || return;
   # update the document's content type
   $doc->set_content_type( 'text/html' );
   return \$content;
}

1;
__END__

=head1 NAME

SWISH::Filters::ppt2html - Perl extension for filtering MS PowerPoint
documents with Swish-e

=head1 DESCRIPTION

This is a plug-in module that uses the xpdf package to convert MS
PowerPoint documents
to html for indexing by Swish-e.  

This filter plug-in requires the xlhtml package which includes ppthtml
available at:

   http://chicago.sourceforge.net/xlhtml

Currently produces document titles like /tmp/foo1234.  Need to alter
to pass actual
document title.


=head1 AUTHOR

Randy Thomas

=head1 SEE ALSO

L<SWISH::Filter>
---- end ppt2html.pm ----


I have not yet figured out how to pass a more useful title back to
Filter.pm.  The code above generates doc titles like "/tmp/foo1234"
where I'd like to have the actual name of the .ppt file instead.  I'm
still reading all the docs, so I'm sure I'll get to the answer
eventually, but if anyone wants to give me a hint I won't mind :)

Another small item I miss from my htdig setup is automatic indexing
inside .zip, .Z, .gz, .tar archives.  I'm not really sure how to chain
the filters so that, after unzipping an archive, the ppt, doc, xls,
html, txt, etc. files inside will be passed to the appropriate filter.
 Does this recursion happen automatically, or do I have to specify it
in my config?

Would it be possible to use FIleFilter directives (even though I'm
using prog / spider.pl )?  Something like:

FileFIlter .gz gzip "-c '%p'"
FileFIlter .zip unzip "-p '%p'"
etc. for all compression/archive types?

Will the files inside each archive be passed along to the next
appropriate filter?  How about (unfortunate cases) where there's a .gz
or .tar file inside a .zip file?  I'd like to dig as deep as possible.

Any hints or tips will be appreciated; thanks again for the great tool.

Randy
Received on Wed Feb 2 08:36:28 2005