Skip to main content.
home | support | download

Back to List Archive

Re: Indexing of word documents, stored on a UNIX

From: FISHER,JOSEPH (Non-HP-Roseville,ex1) <joseph_fisher(at)not-real.non.hp.com>
Date: Mon Aug 20 2001 - 17:21:59 GMT
Hi Bill,

First, let me say that SWISH-E is a wonderful tool... My manager is very,
very pleased with it's performance and scalability...

Now, to address some of my concerns and issues while installing SWISH-E...

You mentioned filter files, configuration files, etc...

I feel that swish-e should have standard configuration files in place, and
each of those configuration files should be specifically named... (If each
different installer chooses their own naming convention, that causes half of
the confusion... Especially when a different installer, like myself, has to
come in and make heads or tails of what someone else has done previously...)

PLEASE... Standardize the naming convention of configuration files... Don't
let the user / installer create their own naming conventions... It's too
difficult to maintain...

In our case, the person who originally installed swish 1.3 created a
configuration file called user.config.fs...

Until I had already spent several hours digging, I did NOT know that this
was the configuration file I needed to place the catdoc lines in...

Example:

	You have the following "possible" configuration files: configure,
config.h, swish.h, filter.h

I found the .../filter-bin/_doc2text.sh script by doing a "find . -exec grep
-l catdoc {} \;" from the command line...

I found various other configuration related entries, using similar find or
grep commands...

When I saw the catdoc entry in this script, I was confused as to where the
entries should go...

Then, when I attempted to put the FileFilter entry in the configuration
file, I wasn't sure whether I needed to change anything in the syntax of the
entry or not... You could say something like: "Place the following line in
your configuration file: FileFilter .doc /usr/local/bin/catdoc "-s8859-1
-d8859-1 '%p'""

Or... Better yet... You should probably place actual entries for the filter
files, inside of your "future", "standardized" configuration file...
Commented out, of course, with a note, saying that the actual executables
need to be installed before uncommenting out the lines...

Thanks in advance, and have a great week...

Joe Fisher

-----Original Message-----
From: Bill Moseley [mailto:moseley@hank.org]
Sent: Friday, August 17, 2001 15:37
To: Multiple recipients of list
Subject: [SWISH-E] Re: Indexing of word documents, stored on a UNIX


At 02:55 PM 08/17/01 -0700, FISHER,JOSEPH (Non-HP-Roseville,ex1) wrote:
>Hi Bill,
>
>Ok, I understand that I need to include a filter file in order to index the
>contents of MS Word documents stored on a Unix system... (As I understand
>it, this was NOT necessary under SWISH 1.3...)

That's always been the case.  Swish-e has never natively parsed word docs.
Rainer added the filter feature to allow indexing other document types.


>I've downloaded and compiled "catdoc"... Catdoc is even referenced in one
of
>the filter files under SWISH-E 2.1...
>
>	.../filter-bin/_doc2text.sh

Again, I would not advise using a shell script for performance reasons.


>I've installed it in it's default location, and made sure that the filter
>file is pointing to the correct directory structure...
>
>But which configuration file should I modify to make SWISH-E sees this MS
>Word filter file?

What config files do you have?

The example in the reference SWIHS-CONFIG I posted shows:

  FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"

That would go in your swish configuration file.  

So you might have swish.conf

  IndexOnly .html .htm .doc .txt
  IndexContents HTML .html .htm
  IndexContents TXT .doc .txt
  FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"

then run

  ./swish-e -c swish.conf -i /home/docs

If the documentation is unclear please say so, and what you think needs to
be changed or is confusing.





Bill Moseley
mailto:moseley@hank.org
Received on Mon Aug 20 17:22:32 2001