Skip to main content.
home | support | download

Back to List Archive

Re: not ignoring content (leave those files alone!)

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Sun Jun 11 2006 - 14:31:33 GMT
On Sat, Jun 10, 2006 at 08:07:12PM -0700, Linda W. (that's swishey, not squishey!) wrote:
> I thought NoContents meant, don't look at the contents of files with
> these extensions, but do index the filenames.  Guess not.

>From the docs:

    If the file's type is HTML or HTML2 (as set by IndexContents or
    DefaultContents) then the file will be parsed for a HTML title and
    that title will be indexed.

Swish has to look at the contents if it's going to find the title.

Swish-e's original start in life about ten years ago was to index a
few web pages on unix machines.  So by default it would index all
files in a directory assuming it was all html.  Ten years ago that
was probably a reasonable method.

Now, most people spider their sites, and the spider can look at the
Content-Type header to determine what to index.  The spider that comes
with swish uses the set of perl modules called SWISH::Filter that can
take a file and try and figure out it's mime type (if not already
known) and then determine if it can be filtered to text for indexing.
The individual filters for SWISH::Filter are separate perl modules
(e.g.  SWISH::Filter::Pdf2HTML.pm) that sometimes use external
programs (e.g. pdftotext and pdfinfo) to convert the file into
text/html.

What gets filtered depends on what you might have installed.  IIRC,
xpdf and catdoc are included in the windows build, where building from
source you have to install those separately.  So, if you use the
spider you will likely not have all these problems.


The DirTree.pl program that's included with the distribution makes
use of SWISH::Filter.  It's simple scans the file system (like the
default mode of swish), but it will filter based on mime type just
like spidering.  So, that may be much easier if you want to scan the
file system instead of spider a web site.

perldoc DirTree.pl for some details, but it's not a very complex
script.

If you want the details of SWISH::Filter see:

    http://swish-e.org/docs/filter.html

The INSTALL doc has examples of indexing, and one is spidering.
Might save yourself a lot of time if you follow those instructions.

http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_form_

My only comment is *I* probably would not use the swish.cgi script.
It's a bit bloated with features.  I think it's easier to just write
a simple search script -- maybe use the search.cgi script for ideas.



-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Sun Jun 11 07:31:35 2006