Skip to main content.
home | support | download

Back to List Archive

Re: Indexing .doc .ppt .xls with filters and prog method

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Aug 19 2005 - 14:07:19 GMT
On Fri, Aug 19, 2005 at 04:53:27AM -0700, Benoit Guguin wrote:
> Ok thank you,
> 
> I Have tested with Dirtree.pl and it's works fine with xls, pdf and doc.
> 
> So I'm currently looking to add filter for  powerpoint and openoffice 
> (sxi, sxw, sxc). But I don't understand the source code  :( ...
> 
> If someone already do this, can he give us the file please ?

You just copy one of the exiting filters in
$srcdir/filters/SWISH/Filters/.  I see there's already a pp2html.pm
filter that requires the ppthtml program:

perldoc pp2html.pm
pp2html(3)            User Contributed Perl Documentation           pp2html(3)



NAME
       SWISH::Filters::pp2html - Perl extension for filtering MS PowerPoint docu-
       ments with Swish-e

DESCRIPTION
       This is a plug-in module that uses the xlhtml package to convert MS Power-
       Point documents to html for indexing by Swish-e.

       This filter plug-in requires the xlhtml package which includes ppthtml
       available at:

          http://chicago.sourceforge.net/xlhtml

       Currently produces document titles like /tmp/foo1234.  Need to alter to
       pass actual document title.

AUTHOR
       Randy Thomas

SEE ALSO
       SWISH::Filter


Check the archives -- I thought someone posted initial work on an
Openoffice filter.



> 
> 
> Thanks again,
> 
> Regards,
> 
> Peter Karman a écrit :
> 
> >The .pm files:
> >
> >  doc2txt.pm
> >  pdf2html.pm
> >  pdf2xml.pm
> >
> >are example modules that predate (iirc) the SWISH::Filters class. The reason 
> >pdf2html works in your script is this line in the pdf2html.pm file:
> >
> >   @EXPORT = qw(pdf2html);
> >
> >which tells Perl to make that function available in your script's namespace with 
> >the 'use' function.
> >
> >I'd suggest using the DirTree.pl example script instead; it calls SWISH::Filter 
> >for you correctly.
> >
> >Benoit Guguin scribbled on 8/19/05 4:45 AM:
> >
> >  
> >
> >>Hello,
> >>
> >>I try to index a directory with only pdf, doc, xls and ppt.
> >>
> >>
> >>I've seen in version 2.5.4 some perl script to filter .ppt, .xls and .doc. 
> >>
> >>I try to use them  with the prog method but when I run swish-e ( 
> >>"swish-e -c /etc/swish-e/swish.conf -S prog") I have thoses erros :
> >>
> >>Undefined subroutine &main::Doc2html called at /etc/swish-e/swish.pl 
> >>line 55.
> >>Or
> >>Undefined subroutine &main::pp2hml called at /etc/swish-e/swish.pl
> >>
> >>The error depends of the order of the functions.
> >>
> >>
> >>So I don't undestand  why it's work fine for pdf but not for others 
> >>format...
> >>
> >>I'm looking around ml archive but dont find my St Graal;)
> >>
> >>Any idea please ?
> >>
> >>Regards,
> >>
> >>
> >>    
> >>
> >
> >  
> >
> 
> 
> -- 
> Guguin Benoit
> Société Alixen 2 rue Jean Rostand 91 893 Orsay Cedex France
> Tel : 01 69 85 24 13, Fax : 01 69 85 24 10
> 
> 

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Aug 19 07:07:25 2005