Oh whoops . . . found my error. The max_indexed/max_files parameters
kept me from getting to the .htm files. How embarassing!
-- John
John Almberg wrote:
>Am I right in thinking that a spider only finds files that are linked to the base document?
>
>I'm using the two configuration files below to spider a site. The site is a mixture of .phtml (php) files and .htm files. However, I can't seem to get the spider to crawl the .htm files, which are definitely linked to the base document. Actually, *directly* from the base document.
>
>Any ideas?
>
>-- John
>
>#swish.conf
>IndexDir ./spider.pl
>SwishProgParameters spider.conf
>DefaultContents HTML
>IndexContents HTML .htm .html .phtml
>
>------------------
>
>#spider.conf:
>@servers = (
> {
> base_url => 'http://www.smithtowngospeltabernacle.org/index23.phtml',
> email => 'jalmberg@identry.com',
>
> # limit to only .phtml files
> test_url => sub { $_[0]->path =~ /\.(phtml|shtml|html|htm)$/ },
>
> delay_min => .0001, # Delay in minutes between requests
> max_time => 10, # Max time to spider in minutes
> max_files => 100, # Max Unique URLs to spider
> max_indexed => 20, # Max number of files to send to swish for indexing
> keep_alive => 1, # enable keep alives requests
> },
>);
>
>
>
>
>
>
>
>
--
~~~~~~~~~~~~~~~~~~~~~~~~~~
Identry, LLC
www.identry.com
jalmberg@identry.com
Received on Thu Jan 23 21:54:18 2003