Skip to main content.
home | support | download

Back to List Archive

indexing Msoft Word docs

From: David Larkin <david.larkin(at)not-real.djl.co.uk>
Date: Fri Dec 02 2005 - 11:30:01 GMT
I started using swish-e yesterday , and first impressions are very favourable.

Following Josh Rabinowitz' 'How to Index anything' I was able to index html and pdf files and then configure swish.cgi to get a web search form.

I'd now like to do the same for word docs.

Using Josh's howto-doc-prog.pl as a starting point

#!/usr/bin/perl -w
use pdf2xml;
my @files =
    `find ./pdf/ -name '*.pdf' -print`;
for (@files) {
    chomp();
    my $xml_record_ref = pdf2xml($_);
    # this is one XML file with a SWISH-E header
    print $$xml_record_ref;
}

I've tried to build an eqiuvelent for word docs, I came up with

#!/usr/bin/perl -w

my @files =
    `find ./doc/ -name '*.doc' -print`;
for (@files) {
    chomp();
    my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
    # this is one XML file with a SWISH-E header
    print $$xml_record_ref;
}

which when I run gives

100:sparrow.djl.co.uk{david}% swish-e -c howto-doc.conf -S prog

Warning: UseStemming is deprecated.  See FuzzyIndexingMode in the docs
Indexing Data Source: "External-Program"
Indexing "./howto-doc-prog.pl"
External Program found: ./howto-doc-prog.pl

Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:
.
101:sparrow.djl.co.uk{david}%


I guess the problem is that catdoc produces text and not xml.

Do I need to modify howto-doc.conf ?

I currently have

# howto-doc.conf

IndexDir ./howto-doc-prog.pl

IndexFile ./howto-doc.index

UseStemming	yes
MetaNames	swishtitle	swishdocpath

Any ideas ?

Thanks
Received on Fri Dec 2 03:30:02 2005