Skip to main content.
home | support | download

Back to List Archive

Re: spider a database

From: Michael Porcaro <music(at)>
Date: Sat Nov 05 2005 - 04:07:43 GMT
When I use this command to spider my site,

Swish-e -S http -I

It takes awhile to spider.  I think I would have to wait about a month
for it to finish everything at that rate.  It seems to print a neater
temp file though, but there seems to be no way to configure this
(example, can't seem to use a swish.conf file)

Yet, when I use this command

Swish-e -S -c swish.conf

Where swish.conf equals:

    IndexOnly .html
    SwishProgParameters default
    Metanames swishtitle swishdocpath
    StoreDescription TXT* 10000
    StoreDescription HTML* <body> 10000
    FuzzyIndexingMode Stemming_en

I can configure it, but it seems to print out garbage in the temp files,
and the temp files seem to blow up.  It also seems to take awhile to

Now you mentioned that swish-e -S http -I is
depreciated, but it is better to use than the following method.  I am
not quite sure I follow.  What is the common way to spider a site?  I'm
confused which method to use.  By the way, I was confused when I said I
wanted to spider a database.  Both the methods I mention seem to spider
my whole site.

How long does it typically take to spider a site that has about 90,000

-----Original Message-----
[] On Behalf Of Bill Moseley
Sent: Friday, November 04, 2005 3:28 PM
To: Multiple recipients of list
Subject: [SWISH-E] Re: spider a database

On Fri, Nov 04, 2005 at 03:16:27PM -0500, Michael Porcaro wrote:
> Please bear with me here and thank you for your patience.  I looked at
> your link and searched around.  By searching, I assume that swish-e
> spider databases, I wasn't really sure about this before.  I came
> this document.  Is this the right thing to read, in order to figure
> how to spider my dynamic pages?

Sorry, I was confused as I thought you wanted to index docs in a
database without using http.  Which is it?

If you want to index stuff in a database then search for the
file in the swish-e distribution.

> Also, I am confused as to where I should direct the config file to
> spider the dynamic links.  Let's say I want to spider this particular
> file:

How does the spider, of anyone for that matter, if that's a static
file or a dynamically generated file?

> Piano-Music-f50.html is actually a php generated file with an html
> alias, but I don't know where to direct swish-e to spider this file.

I have no idea what an html alias is in that context, but you point
the spider to the same place you would point anyone else.  To its url.

> When I spider the files under /home/yc/www/forum (my local site for
>, all it does is spider the files that run the
> forum, not the actual content dynamic pages, such as
> "Piano-Music-f50.html" or equivalently

The term "spider" implies you are spidering your web site, most likely
with the oddly named program "".  That would be spidering
like google does -- by accessing your documents via the web.

Please go back and look at the docs again.

> So I guess my basic question would be, what is the address of my
> files?  A very poor guess is, my database files are located here:
> /var/lib/mysql/
> But is this the address to spider?  Or do I spider /home/yc/www/forum
> instead?  

Maybe better is someone else answers that one.

Bill Moseley

Unsubscribe from or help with the swish-e list:

Help with Swish-e:
Received on Fri Nov 4 20:07:54 2005