On Fri, 2 Dec 2005 05:10:26 -0800 (PST)
Bill Moseley <moseley@hank.org> wrote:
> By the way:
>
> On Fri, Dec 02, 2005 at 03:29:45AM -0800, David Larkin wrote:
> >
> > #!/usr/bin/perl -w
> > use pdf2xml;
> > my @files =
> > `find ./pdf/ -name '*.pdf' -print`;
> > for (@files) {
> > chomp();
> > my $xml_record_ref = pdf2xml($_);
> > # this is one XML file with a SWISH-E header
> > print $$xml_record_ref;
> > }
> >
> > I've tried to build an eqiuvelent for word docs, I came up with
> >
> > #!/usr/bin/perl -w
> >
> > my @files =
> > `find ./doc/ -name '*.doc' -print`;
> > for (@files) {
> > chomp();
> > my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
> > # this is one XML file with a SWISH-E header
> > print $$xml_record_ref;
> > }
> > Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:
>
> Note:
>
> > my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
> > print $$xml_record_ref;
>
> You are just feeding the content to swish. How will swish know where
> one doc ends and the next one starts? Or each docs's file name?
Using catdoc was just a stab in the dark. I've now found doc2txt which appears to do what I'm after.
#!/usr/bin/perl -w
use doc2txt;
my @files =
`find ./doc/ -name '*.doc' -print`;
for (@files) {
chomp();
my $txt_record_ref = doc2txt($_);
print $$txt_record_ref;
}
It is very similar to Josh's howto-pdf-prog.pl example.
Does your point regarding "How will swish know where one doc ends and the next one starts?" still hold for this solution ?
It appears to work , and I can follow the logic, but I'm struggling to understand the documentation for 'SWISH::Filter'
Only thing I'm looking for now is a xls2txt and ppt2txt equivalents of doc2txt.
>
> I suspect if you ran these from the command line you would see the
> difference. The -S prog method needs headers to know what the file
> name is and how long it is.
>
> Use SWISH::Filter.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
> Unsubscribe from or help with the swish-e list:
> http://swish-e.org/Discussion/
>
> Help with Swish-e:
> http://swish-e.org/current/docs
> swish-e@sunsite.berkeley.edu
>
Received on Fri Dec 2 06:06:42 2005