Skip to main content.
home | support | download

Back to List Archive

Max depth error??

From: Juan Salvador Castejón <juans.castejon(at)not-real.gmail.com>
Date: Tue May 24 2005 - 08:31:17 GMT
Hi, 

I'm indexing a web site using spider.pl on a windows XP machine. The
problem is that shiwsh-e does not index files whose depth is >= 5. The
spider is crawling rightly all the pages and supplying all them to
swish-e, but swish ignores completly all the pages whose depth is >=
5.

These are my configuration files:

>> spider.conf

my %carm = (
	use_default_config => 1,
	max_depth	=> 10,
	delay_sec	=> 0,
	max_size	=> 0,
	use_cookies	=> 1,
	debug		=> DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED | DEBUG_ERRORS |
DEBUG_INFO | DEBUG_LINKS | DEBUG_HEADERS,
	base_url	=> 'http://www.carm.es/ceh/',
	email		=> 'juans.castejon@carm.es',
	link_tags	=> [qw/ a frame area /],
	keep_alive	=> 1,
	test_response	=> sub {
				my $server = $_[1];	
				$server->{no_spider} = $_[0]->path =~
/.*\.(pdf|PDF|doc|DOC|xls|XLS|rtf|RTF|ppt|PPT)$/;
				$server->{no_contents} = $_[0]->path =~
/.*\.(mp3|avi|wma|jpg|gif|zip|bat|bmp|dot|eps|mdb|png|pps|psd|swf|tiff|wmf|wmv|tif|dwg|exe)$/;
				$server->{no_contents} = $_[2]->content_type =~ m[^image/];
				return 1;
			       },
	test_url	=> sub {
				$_[0]->as_string =~ /^(http:\/\/)?www.carm.es\/ceh\/(.)*/; 
			       },
);

@servers = (\%carm);
1;

>> swish.conf

IndexDir perl.exe

SwishProgParameters lib\swish-e\spider.pl lib\swish-e\spider.conf

StoreDescription HTML2 <body> 2500

StoreDescription TXT2 2500

StoreDescription HTML <body> 2500

StoreDescription TXT 2500

PropertyNameAlias swishdescription description

DefaultContents HTML2

IndexContents HTML* .htm .html .shtml .xhtml .jsp

IndexContents TXT*  .txt .log .text .pdf .doc .rtf 

IndexContents XML*  .xml

TranslateCharacters :ascii7:

ParserWarnLevel 9 

IgnoreTotalWordCountWhenRanking no


I run the swish-e as:  

swish-e.exe -S prog -c swish.conf -f index5.swish -v 3 -R 1

Attached to this message are the output logs generated by the index
process. Using them you can see how the spider is finding certaing
pages (all of them with depth >= 5) that swish-e ignores (i.e. urls
containing 'codmenu=91').

Is it a bug or I'm missing something? Any help would be greatly appreciated.
Thank you in advance.

Juan Salvador Castejón,
Received on Tue May 24 01:31:22 2005