On Fri, 2 Dec 2005 03:29:47 -0800 (PST)
David Larkin <david.larkin@djl.co.uk> wrote:
I found doc2txt.pm and this works
#!/usr/bin/perl -w
use doc2txt;
my @files =
`find ./doc/ -name '*.doc' -print`;
for (@files) {
chomp();
# my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
my $xml_record_ref = doc2txt($_);
# this is one XML file with a SWISH-E header
print $$xml_record_ref;
}
;-)
guess i should really rename the $xml_record_ref variable, but i can now search using swish.cgi, which is great
> I started using swish-e yesterday , and first impressions are very favourable.
>
> Following Josh Rabinowitz' 'How to Index anything' I was able to index html and pdf files and then configure swish.cgi to get a web search form.
>
> I'd now like to do the same for word docs.
>
> Using Josh's howto-doc-prog.pl as a starting point
I meant to say howto-doc-prog.pl
>
> #!/usr/bin/perl -w
> use pdf2xml;
> my @files =
> `find ./pdf/ -name '*.pdf' -print`;
> for (@files) {
> chomp();
> my $xml_record_ref = pdf2xml($_);
> # this is one XML file with a SWISH-E header
> print $$xml_record_ref;
> }
>
> I've tried to build an eqiuvelent for word docs, I came up with
>
> #!/usr/bin/perl -w
>
> my @files =
> `find ./doc/ -name '*.doc' -print`;
> for (@files) {
> chomp();
> my $xml_record_ref = exec "/usr/local/bin/catdoc $_";
> # this is one XML file with a SWISH-E header
> print $$xml_record_ref;
> }
>
> which when I run gives
>
> 100:sparrow.djl.co.uk{david}% swish-e -c howto-doc.conf -S prog
>
> Warning: UseStemming is deprecated. See FuzzyIndexingMode in the docs
> Indexing Data Source: "External-Program"
> Indexing "./howto-doc-prog.pl"
> External Program found: ./howto-doc-prog.pl
>
> Warning: Unknown header line: 'Cricket Roundup' from program ./howto-doc-prog.plerr: External program failed to return required headers Path-Name:
> .
> 101:sparrow.djl.co.uk{david}%
>
>
> I guess the problem is that catdoc produces text and not xml.
>
> Do I need to modify howto-doc.conf ?
>
> I currently have
>
> # howto-doc.conf
>
> IndexDir ./howto-doc-prog.pl
>
> IndexFile ./howto-doc.index
>
> UseStemming yes
> MetaNames swishtitle swishdocpath
>
> Any ideas ?
>
> Thanks
Received on Fri Dec 2 04:27:47 2005