Skip to main content.
home | support | download

Back to List Archive

Re: spider a database

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sat Nov 05 2005 - 04:09:56 GMT
the documentation is not clear on this point, but you should not use -S http.

Instead, use the spider.pl script with the -S prog option. Much better 
performance and control. See the spider.pl perldoc.

Michael Porcaro scribbled on 11/4/05 10:05 PM:

> When I use this command to spider my site,
> 
> Swish-e -S http -I http://www.youngcomposers.com
> 
> It takes awhile to spider.  I think I would have to wait about a month
> for it to finish everything at that rate.  It seems to print a neater
> temp file though, but there seems to be no way to configure this
> (example, can't seem to use a swish.conf file)
> 
> Yet, when I use this command
> 
> Swish-e -S -c swish.conf
> 
> Where swish.conf equals:
> 
>     IndexDir spider.pl
>     IndexOnly .html
>     SwishProgParameters default http://www.youngcomposers.com
>     Metanames swishtitle swishdocpath
>     StoreDescription TXT* 10000
>     StoreDescription HTML* <body> 10000
>     FuzzyIndexingMode Stemming_en
> 
> I can configure it, but it seems to print out garbage in the temp files,
> and the temp files seem to blow up.  It also seems to take awhile to
> index.
> 
> Now you mentioned that swish-e -S http -I http://www.mysite.com is
> depreciated, but it is better to use than the following method.  I am
> not quite sure I follow.  What is the common way to spider a site?  I'm
> confused which method to use.  By the way, I was confused when I said I
> wanted to spider a database.  Both the methods I mention seem to spider
> my whole site.
> 
> How long does it typically take to spider a site that has about 90,000
> pages?
> 
> -----Original Message-----
> From: swish-e@sunsite3.berkeley.edu
> [mailto:swish-e@sunsite3.berkeley.edu] On Behalf Of Bill Moseley
> Sent: Friday, November 04, 2005 3:28 PM
> To: Multiple recipients of list
> Subject: [SWISH-E] Re: spider a database
> 
> On Fri, Nov 04, 2005 at 03:16:27PM -0500, Michael Porcaro wrote:
> 
>>Please bear with me here and thank you for your patience.  I looked at
>>your link and searched around.  By searching, I assume that swish-e
> 
> can
> 
>>spider databases, I wasn't really sure about this before.  I came
> 
> across
> 
>>this document.  Is this the right thing to read, in order to figure
> 
> out
> 
>>how to spider my dynamic pages?
> 
> 
> Sorry, I was confused as I thought you wanted to index docs in a
> database without using http.  Which is it?
> 
> If you want to index stuff in a database then search for the MySQL.pl
> file in the swish-e distribution.
> 
>  
> http://cvs.sourceforge.net/viewcvs.py/swishe/swish-e/prog-bin/MySQL.pl?r
> ev=1.2&view=auto
> 
> 
>>Also, I am confused as to where I should direct the config file to
>>spider the dynamic links.  Let's say I want to spider this particular
>>file:
>>
>>http://www.youngcomposers.com/forum/Piano-Music-f50.html
> 
> 
> How does the spider, of anyone for that matter, if that's a static
> file or a dynamically generated file?
> 
> 
>>Piano-Music-f50.html is actually a php generated file with an html
>>alias, but I don't know where to direct swish-e to spider this file.
> 
> 
> I have no idea what an html alias is in that context, but you point
> the spider to the same place you would point anyone else.  To its url.
> 
> 
> 
>>When I spider the files under /home/yc/www/forum (my local site for
>>www.youngcomposers.com), all it does is spider the files that run the
>>forum, not the actual content dynamic pages, such as
>>"Piano-Music-f50.html" or equivalently
>>http://www.youngcomposers.com/forum/index.php?showforum=50
> 
> 
> The term "spider" implies you are spidering your web site, most likely
> with the oddly named program "spider.pl".  That would be spidering
> like google does -- by accessing your documents via the web.
> 
> Please go back and look at the docs again.
> 
> http://swish-e.org/docs/install.html#general_configuration_and_usage
> 
> http://swish-e.org/docs/install.html#spidering_and_searching_with_a_web_
> form_
> 
> http://swish-e.org/docs/spider.html
> 
> 
> 
>>So I guess my basic question would be, what is the address of my
> 
> dynamic
> 
>>files?  A very poor guess is, my database files are located here:
>>
>>/var/lib/mysql/
>>
>>But is this the address to spider?  Or do I spider /home/yc/www/forum
>>instead?  
> 
> 
> Maybe better is someone else answers that one.
> 

-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
Received on Fri Nov 4 20:09:56 2005