Skip to main content.
home | support | download

Back to List Archive

Re: indexing Msoft Word docs

From: David Larkin <david.larkin(at)not-real.djl.co.uk>
Date: Fri Dec 02 2005 - 14:06:41 GMT
On Fri, 2 Dec 2005 05:10:26 -0800 (PST)
Bill Moseley <moseley@hank.org> wrote:

> By the way:
> 
> On Fri, Dec 02, 2005 at 03:29:45AM -0800, David Larkin wrote:
> > 
> > #!/usr/bin/perl -w
> > use pdf2xml;
> > my @files =
> >     `find ./pdf/ -name '*.pdf' -print`;
> > for (@files) {
> >     chomp();
> >     my $xml_record_ref = pdf2xml($_);
> >     # this is one XML file with a SWISH-E header
> >     print $$xml_record_ref;
> > }
> > 
> > I've tried to build an eqiuvelent for word docs, I came up with
> > 
> > #!/usr/bin/perl -w
> > 
> > my @files =
> >     `find ./doc/ -name '*.doc' -print`;
> > for (@files) {
> >     chomp();
> >     my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
> >     # this is one XML file with a SWISH-E header
> >     print $$xml_record_ref;
> > }
> > Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:
> 
> Note:
> 
> >     my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
> >     print $$xml_record_ref;
> 
> You are just feeding the content to swish.  How will swish know where
> one doc ends and the next one starts?  Or each docs's file name?


Using catdoc was just a stab in the dark. I've now found doc2txt which appears to do what I'm after.

#!/usr/bin/perl -w
use doc2txt;

my @files =
    `find ./doc/ -name '*.doc' -print`;
for (@files) {
    chomp();
    my $txt_record_ref = doc2txt($_);
    print $$txt_record_ref;
}

It is very similar to Josh's howto-pdf-prog.pl example.

Does your point regarding "How will swish know where one doc ends and the next one starts?" still hold for this solution ?

It appears to work , and I can follow the logic, but I'm struggling to understand the documentation for 'SWISH::Filter'

Only thing I'm looking for now is a xls2txt and ppt2txt equivalents of doc2txt.

> 
> I suspect if you ran these from the command line you would see the
> difference.  The -S prog method needs headers to know what the file
> name is and how long it is.
> 
> Use SWISH::Filter.
> 
> 
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list: 
>    http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
> 
Received on Fri Dec 2 06:06:42 2005