Skip to main content.
home | support | download

Back to List Archive

Re: VERY CONFUSED ABOUT FILTERS

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Dec 02 2005 - 17:00:14 GMT
On Fri, Dec 02, 2005 at 07:56:33AM -0800, David Larkin wrote:
> Then I read a paragraph which I simply don't understand.
> 
> But, Swish-e will not use SWISH::Filter by default when using the
> file system method of indexing. To use SWISH::Filter when indexing
> by file system method (-S fs), you can use a FileFilter directive
> with the swish_filter.pl filter (which is just a program that uses
> SWISH::Filter) or use the -S prog method of indexing and use the
> DirTree.pl program for fetching documents.

I'll get to that in a second.

Think of swish in small functional units.

Swish basically parses html, xml, or text and creates an index.
How it gets documents varies and that's a separate feature.

Ok, so first there's the default -S fs -- that uses a built-in bit of
code to walk the file system and read files.  That's really all it
knows how to do.  But what do you do when you have non-text/xml/html
docs?

Then "FileFilter" was added as a way for *swish* to pass a document to
a program and read back the program's output.  You needed to define a
filter for each type of file (based on file extension).  That's not
great for a number of reasons (what does file extension have to do
with anything???) and you have to be specific about what programs
filter what.

[I'm leaving out the -S http method because it stinks]


Then swish added the -S prog method which allowed swish to read input
from STDIN if the input was formatted correctly (a header before each
document).  That meant you could do something like:

    some_program | swish-e -S prog -i stdin

All some_program has to do is output text, xml, or html, and a header
for each file saying the file name, file length, last modified, etc.

Now it would be nice to have a utility that can take any document and
look at it's content-type and then decided how to filter it into one
of the three formats that swish understands.  That's what
SWISH::Filter does.

SWISH::Filter is passed a file (in memory or on disk) and it
determines the file's content type and then looks for a filter for
that file.  It then returns the filtered file.

You can't really use it this way, but it's basically like:

    fetch_files | SWISH::Filter | swish-e -S prog -i stdin

It's really done like this:

    DirTree.pl <some params> | swish-e -S prog -i stdin

or
    spider.pl <some params> | swish-e -S prog -i stdin

Look at DirTree.pl in your distribution and see how that works.


SWISH::Filter automatically loads filters that are installed.
SWISH::Filter also uses helper programs, for example it uses "catdoc"
to read MS Word docs.

So a user on a Debian-based system might do this:

    spider.pl <some params> | swish-e -S prog -i stdin

and realize that MS Word docs are not being indexed.  Then they would
do:

    # apt-get install catdoc

and then they would magically get indexed because catdoc is now
available on the computer.  It works because there's already a
SWISH::Filter::Pdf2HTML.pm module that knows how to use catdoc -- if
catdoc is installed.

Or, say someone wants to index OpenOffice docs and there isn't an
existing SWISH::Filter:: to do the work.  So they create a
SWISH::Filter::OO2html.pm file (by copying an existing filter) and
then magically OO docs will be indexed with no changes to any configs.

The terminology is poor.  SWISH::Filter is a module that loads
SWISH::Filters::* modules.  A SWISH::Filters::* module may do all the
work of filtering, or it may use other modules or programs.  Like
above, "catdoc" is used to read MS Word docs.

There's a wrapper program for SWISH::Filter called swish-filter-test:

$ swish-filter-test 050819-securing-mac-os-x-tiger.pdf 

Document 050819-securing-mac-os-x-tiger.pdf was  filtered.
   Document:     050819-securing-mac-os-x-tiger.pdf  (050819-securing-mac-os-x-tiger.pdf)
   Content-Type: text/html
   Parser type:  HTML*

   >Filter used: SWISH::Filters::Pdf2HTML=HASH(0x88b7994) ( application/pdf -> text/html )


Now, back to that paragraph.

The FileFilter allows swish to take a document it's processing and
pass it to an external program.  So, there's the "swish_filter.pl"
program that allows you to use SWISH::Filter via a FileFilter
directive.  I don't recommend using it that way, but it's possible.



> Can I have a single index for a directory with different filetypes ?

Sure.


> I guess I have to add lines like 
> 
> FileFilter .pdf  pdftotext   "'%p' -" IndexContents TXT* .pdf
> 
> to the config file

You can still do that, but I think it's better to use SWISH::Filter.


> But then what args do I use with swish-e to create the index ?
> 
> Is there an example, or tutorial anywhere ?

Are the docs that hard to follow?

There's this:

  http://swish-e.org/docs/install.html#general_configuration_and_usage

which includes three steps to index a site.

Then this follows that:

  http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_form_

which has all the steps for not only indexing but for creating a
search page.  If you have catdoc and xpdf installed that will index
Word and PDF docs.

Right after that is:

  http://swish-e.org/docs/install.html#indexing_other_types_of_documents_filtering

Which you read.  It clearly states:

   This has resulting in a bit of confusion.

which I wonder if that was on purpose.

Which, I think, explains what I said above.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Fri Dec 2 09:00:21 2005