Hi,
I'm indexing a web site using spider.pl on a windows XP machine. The
problem is that shiwsh-e does not index files whose depth is >= 5. The
spider is crawling rightly all the pages and supplying all them to
swish-e, but swish ignores completly all the pages whose depth is >=
5.
These are my configuration files:
>> spider.conf
my %carm = (
use_default_config => 1,
max_depth => 10,
delay_sec => 0,
max_size => 0,
use_cookies => 1,
debug => DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED | DEBUG_ERRORS |
DEBUG_INFO | DEBUG_LINKS | DEBUG_HEADERS,
base_url => 'http://www.carm.es/ceh/',
email => 'juans.castejon@carm.es',
link_tags => [qw/ a frame area /],
keep_alive => 1,
test_response => sub {
my $server = $_[1];
$server->{no_spider} = $_[0]->path =~
/.*\.(pdf|PDF|doc|DOC|xls|XLS|rtf|RTF|ppt|PPT)$/;
$server->{no_contents} = $_[0]->path =~
/.*\.(mp3|avi|wma|jpg|gif|zip|bat|bmp|dot|eps|mdb|png|pps|psd|swf|tiff|wmf|wmv|tif|dwg|exe)$/;
$server->{no_contents} = $_[2]->content_type =~ m[^image/];
return 1;
},
test_url => sub {
$_[0]->as_string =~ /^(http:\/\/)?www.carm.es\/ceh\/(.)*/;
},
);
@servers = (\%carm);
1;
>> swish.conf
IndexDir perl.exe
SwishProgParameters lib\swish-e\spider.pl lib\swish-e\spider.conf
StoreDescription HTML2 <body> 2500
StoreDescription TXT2 2500
StoreDescription HTML <body> 2500
StoreDescription TXT 2500
PropertyNameAlias swishdescription description
DefaultContents HTML2
IndexContents HTML* .htm .html .shtml .xhtml .jsp
IndexContents TXT* .txt .log .text .pdf .doc .rtf
IndexContents XML* .xml
TranslateCharacters :ascii7:
ParserWarnLevel 9
IgnoreTotalWordCountWhenRanking no
I run the swish-e as:
swish-e.exe -S prog -c swish.conf -f index5.swish -v 3 -R 1
Attached to this message are the output logs generated by the index
process. Using them you can see how the spider is finding certaing
pages (all of them with depth >= 5) that swish-e ignores (i.e. urls
containing 'codmenu=91').
Is it a bug or I'm missing something? Any help would be greatly appreciated.
Thank you in advance.
Juan Salvador Castejón,
Received on Tue May 24 01:31:22 2005