Re: not ignoring content (leave those files alone!)

From: Linda W. (that's swishey, not squishey!) <swishey(at)>
Date: Sun Jun 11 2006 - 20:55:27 GMT
Bill Moseley wrote:
> Now, most people spider their sites, and the spider can look at the
> Content-Type header to determine what to index.
	I'm trying to index local file systems, not a website.

> What gets filtered depends on what you might have installed.  IIRC,
> xpdf and catdoc are included in the windows build, where building from
> source you have to install those separately.  So, if you use the
> spider you will likely not have all these problems.
	spider seems designed for a website, not a local file system.

> The program that's included with the distribution makes
> use of SWISH::Filter.  It's simple scans the file system (like the
> default mode of swish), but it will filter based on mime type just
> like spidering.  So, that may be much easier if you want to scan the
> file system instead of spider a web site.
	The problem, I think as you mentioned, was that "NoContents" will
still look through binary files to find a title (or content type).  It
seems to take a long time on large binary files.  In the one directory
I have scanned so far, it too 5 minutes just to plow through 1 2.2M file.

	Maybe NoContents would be better named "FileMetaOnly" -- that
makes it clear that the file may be scanned by the default scanner for HTML

> perldoc for some details, but it's not a very complex
> script.
	It's my first shot on some of this, and wanted to try simpler (though
less efficient) methods first (verify and get comfortable with the basic

> If you want the details of SWISH::Filter see:
> The INSTALL doc has examples of indexing, and one is spidering.
> Might save yourself a lot of time if you follow those instructions.
	Looked through all of them before writing my first conf file.
At some point the nomenclature for conf files reads "config".  I found
this a bit confusing.  With the main config file being referred to as
swish.conf, I str8away looked for *.conf.  Didn't pick up the examples
in the config dir until later examination...

> My only comment is *I* probably would not use the swish.cgi script.
> It's a bit bloated with features.  I think it's easier to just write
> a simple search script -- maybe use the search.cgi script for ideas.
	"Teaching" scripts don't have to be the most efficient, though
efficient examples of "best practices", are certainly a great aide.

	Seems like it has been a while since the last release.  Is that expected
to remain the same in the near future or do you think there will be more 
frequent releases coming up?

