Skip to main content.
home | support | download

Back to List Archive

RE: SWISH-e Problem with doc2txt

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Feb 11 2003 - 15:20:22 GMT
On Tue, 11 Feb 2003, Michael REMY wrote:

That doc2txt.pm module is suppose to be called from another script --
such as from the spider.pl program.  For example, spider.pl fetches a
document, sees that's it's a MS Word doc (from the content type) and then
uses the doc2txt.pm *module* to convert it to text and returns the
document.

And doc2txt.pm can be used two ways:  One way is if you pass in a
reference to a doc (the MS Word doc is already in memory as is the case
for spider.pl) it simply returns the converted doc as a reference to a
scalar.  To avoid a double-ended pipe doc2txt.pm writes the file in memory
out to a temp file instead of piping directly to catdoc.  (That's probably
not necessary.)

The other way to use doc2txt.pm is when passing in a file name.  For
example, if you are scanning a local directory of files, when you see a
.doc file you can pass the file name to doc2txt.pm and it will return not
only the converted file, but the headers required for use with swish-e's
-S prog.  Indeed, doc2txt.pm is designed only for use with -S prog type of
input, not as a stand alone program to convert MS Word files.

> to solve my problem with the doc2txt.pm, i had been done to add this lines
> before the doc2txt sub in your script :
> 
> my $file = shift || die "Usage: $0 <filename>\n";
> system("catdoc -a $file > /tmp/toto.txt");
> system("cat /tmp/toto.txt");
> system("unlink /tmp/toto.txt");
> 
> sub doc2txt {........etc.

Well, I don't understand that.  You are using system() when you should be
using backticks.

   my $doc = `catdoc -a $file`;

See perldoc perlfaq8

  Why can't I get the output of a command with system()?


> swish-e -cind_138.conf -l -v 3 -T

(Note: -T requires a paramater).

You seem to want to convert that module into a program that just converts
MS Word files, or so I assume.  The doc2txt.pm module is used for -S prog
programs and you are not using it as such (no -S prog in your command
above).

If your goal is to convert .doc files to text, then again all you need to
do is use a FileFilter entry:

>From the example in the documentation:

  FileFilter .doc  /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"

You don't need a perl program to help with that, and using a perl program
will just slow indexing down.


-- 
Bill Moseley moseley@hank.org
Received on Tue Feb 11 15:24:06 2003