Skip to main content.
home | support | download

Back to List Archive

Re: PowerPoint module for spider.pl

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Jul 08 2004 - 15:20:18 GMT
On Thu, Jul 08, 2004 at 06:51:23AM -0700, Alan Ivey wrote:
> 
> I edited Doc2txt.pm like you showed, and now I'm
> trying to write a Ppt2txt.pm. There isn't a binary
> that converts ppt to txt, but rather html (ala
> ppthtml). The only problem is, the <TITLE/> is the
> full filename and path, which, with SWISH-E, makes it
> like /tmp/sddwt4g490 or whatever.

Does ppthtml extract a valuable title from the document?


> I know I can pipe the output through w3m with some
> options to strip the HTML tags to make it text, but
> I'm having a hard time figuring out how to make it
> work in a module. Using the doc2txt.pm as an example,
> I tried about 20 different things I was hoping would
> work but no luck. 

Do you even need to strip the HTML?  Just let swish-e do it with its
html parser.


> 
> How would I change the line...
> my $content = $filter->run_program( $self->{ppthtml},
> $file )
> 
> To do the bash equivilent of...
> ppthtml [filegoeshere] | w3m -dump -T text/html | perl
> -pe 's/\xa0/ /g'
> ?

If you really need to do that then there's a few ways.  First, the
filter could flag it as a new content type and also say the filtering
is not complete and the use a secondary filter to strip the html.

Another way would be to write the content (or the output of ppthtml)
to a file and then use another run_program() line to process it again.

Or you can just use a shell call, either backticks or system().

(I didn't try these):

    $content = `ppthtml $file | w3m -dump -T text/html | perl -pe 's/\xa0/ /g'`;
or
    system("ppthtml $file | w3m -dump -T text/html | perl -pe 's/\xa0/ /g' > outfile")

and then read the file back in.

I would likely not do either of those -- I try to avoid the shell for
security reasons.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Jul 8 08:20:32 2004