Skip to main content.
home | support | download

Back to List Archive

Re: indexing Msoft Word docs

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Dec 02 2005 - 13:10:31 GMT
By the way:

On Fri, Dec 02, 2005 at 03:29:45AM -0800, David Larkin wrote:
> 
> #!/usr/bin/perl -w
> use pdf2xml;
> my @files =
>     `find ./pdf/ -name '*.pdf' -print`;
> for (@files) {
>     chomp();
>     my $xml_record_ref = pdf2xml($_);
>     # this is one XML file with a SWISH-E header
>     print $$xml_record_ref;
> }
> 
> I've tried to build an eqiuvelent for word docs, I came up with
> 
> #!/usr/bin/perl -w
> 
> my @files =
>     `find ./doc/ -name '*.doc' -print`;
> for (@files) {
>     chomp();
>     my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
>     # this is one XML file with a SWISH-E header
>     print $$xml_record_ref;
> }
> Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:

Note:

>     my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
>     print $$xml_record_ref;

You are just feeding the content to swish.  How will swish know where
one doc ends and the next one starts?  Or each docs's file name?

I suspect if you ran these from the command line you would see the
difference.  The -S prog method needs headers to know what the file
name is and how long it is.

Use SWISH::Filter.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Dec 2 05:10:32 2005