Skip to main content.
home | support | download

Back to List Archive

Antw: [SWISH-E:424] Re: ndexing PDF

From: Rainer Scherg RTC <Rainer.Scherg(at)not-real.rexroth.de>
Date: Tue Aug 11 1998 - 16:03:33 GMT
> On Tue, 11 Aug 1998, Rainer Scherg RTC wrote:
> 
> > But at this moment I've only installed a filter for PDF files (on a
>  Solaris 
> > machine). My hope hope is, that - if these feature will be released -
>  people 
> > are starting to write filter progs for different file types. (MS-Word,
> > XLS, PPT, and so on...)
> 
>       It would be very nice to have Unix-based filters for Microsoft
>       Office formats, but reverse-engineering those formats would be
>       extremely difficult since you have to handle multiple versions
>       fo the, e.g., Word 5, Word 6, Word 97, Word 98, Mac, PC; ditto
>       for Excel and PowerPoint.

>       It's for these reasons I punted in SWISH++ by writing a generic
>       text extraction process.  It's not perfect (nor can be without
>       detailed file-format information) but it seems to do a good job
>       in practice.  See the documentation for details.

Jap!

I'm using a very simple filter prog to index Winword Docs on our servers:

Config:
    FileFilter  .doc  simple_txt_extract.sh

----- snip --------
#!/bin/sh
# -- simple_txt_extract.sh <docfile>

cat $1 | strings
---- snap ---------

You are getting occasionally some garbage characters - escpecially when 
images are included. But it's sufficient for indexing the doc.

Additionally I've got a private response pointing to a tool called "catdoc".


BTW: Some words to swish++
  So far I've only read the Readme file (some weeks ago).
  But for my personal flavor swish++ is lacking some features I want to
  see (e.g. Config Files) at this moment.

  But I've a wish list for swish-e, too... ;-)

ciao Rainer
Received on Tue Aug 11 10:42:49 1998