Re: search only .html and no extension files

From: Michael Porcaro <music(at)>
Date: Wed Nov 09 2005 - 07:55:46 GMT
Great, after an agonizing week of research of swish-e, I think I almost
have the basics to spider a site, but not quite yet.  Let me just
explain what was confusing for me, and most likely other newbies, and
hopefully I am now on the right track.  You don't have to answer my
rhetorical questions or comments, maybe just a "your on the right track,
or your totally off":

1.  Using a config file (-c config) doesn't seem to work well when using
the -S prog command.  I also noticed you can't run a config file and a at the same time.  Is this true?  There really is
no point in doing this anyway.

2.  You need the -S prog command when you choose to use  When
indexing with, you don't need swish.conf, (not sure if you CAN
use it though) you DO need  Use this command:
swish-e -S prog -I and will be called
automatically, as long as that file is in the same directory.

3.  .swishcgi.conf was REALLY confusing me.  This file apparently isn't
used for the spidering process, it seems to be used AFTER the process is
done, to control how many searches per page, the title, etc.  Correct?

4. replaces a config file and is more efficient.
It is written in perl, so it has more complex coding, but more power and
control. is simply the config file for

My question now is regarding the test_url function.  Basically, I am
interested in only spidering html and non extension files.  Here is an
example of a non extension file:

I tried this command in which was in your

test_url    => sub { $_[0]->path =~ /\.html?$/ },

But it doesn't seem to work.  It keeps saying error, no files were
indexed.  When I comment this file out, the spidering does work, so
there seems to be a problem with that line of code.  Any suggestions?
Are there other ways to "index only html" or is test_url the best way to
do this?

-----Original Message-----
[] On Behalf Of Bill Moseley
Sent: Tuesday, November 08, 2005 11:44 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: search only .html and no extension files

On Tue, Nov 08, 2005 at 12:49:31PM -0800, Michael Porcaro wrote:
> Hi,
> Question 1:  
> Lets say I add a new page.  Do I have to spider the whole site again
> index the 1 page?

Mostly, yes.

> Question 2:
> I finally was able to spider my site, and get the search engine to
> One problem now:
> The spider indexed every single link when I instructed it to index
> by using this config file called swish.conf
> # Use for indexing 
> IndexDir
> IndexOnly .html

IndexOnly isn't used when using -S prog input method (i.e. using

> It took about 7 hours to spider the whole site with this command:
> Swish-e -e -S prog -c swish.conf
> There are a lot of useless links in the index file which is 80 megs.
> How can I filter out every page except .html?  How come it didn't obey
> the config file? should cover most of that.

Bill Moseley

