On Thu, Jul 08, 2004 at 06:51:23AM -0700, Alan Ivey wrote:
>
> I edited Doc2txt.pm like you showed, and now I'm
> trying to write a Ppt2txt.pm. There isn't a binary
> that converts ppt to txt, but rather html (ala
> ppthtml). The only problem is, the <TITLE/> is the
> full filename and path, which, with SWISH-E, makes it
> like /tmp/sddwt4g490 or whatever.
Does ppthtml extract a valuable title from the document?
> I know I can pipe the output through w3m with some
> options to strip the HTML tags to make it text, but
> I'm having a hard time figuring out how to make it
> work in a module. Using the doc2txt.pm as an example,
> I tried about 20 different things I was hoping would
> work but no luck.
Do you even need to strip the HTML? Just let swish-e do it with its
html parser.
>
> How would I change the line...
> my $content = $filter->run_program( $self->{ppthtml},
> $file )
>
> To do the bash equivilent of...
> ppthtml [filegoeshere] | w3m -dump -T text/html | perl
> -pe 's/\xa0/ /g'
> ?
If you really need to do that then there's a few ways. First, the
filter could flag it as a new content type and also say the filtering
is not complete and the use a secondary filter to strip the html.
Another way would be to write the content (or the output of ppthtml)
to a file and then use another run_program() line to process it again.
Or you can just use a shell call, either backticks or system().
(I didn't try these):
$content = `ppthtml $file | w3m -dump -T text/html | perl -pe 's/\xa0/ /g'`;
or
system("ppthtml $file | w3m -dump -T text/html | perl -pe 's/\xa0/ /g' > outfile")
and then read the file back in.
I would likely not do either of those -- I try to avoid the shell for
security reasons.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Thu Jul 8 08:20:32 2004