Ullas wrote on 6/9/09 11:58 PM: > Thanks for the reply ... > > the output immediatley before the error is: > > ////////////////////////////////////////////////////////////////////////////////////////// > > http://www.admiralmotorinn.com.au/index.php?pageid=4080 - Using HTML2 > parser - (262 words) > http://www.admiralmotorinn.com.au/index.php?pageid=3953 - Using HTML2 > parser - (177 words) > http://www.admiralmotorinn.com.au/index.php?pageid=4115 - Using HTML2 > parser - (185 words) > > Summary for: http://www.admiralmotorinn.com.au/ > Connection: Close: 1 (0.5/sec) > Connection: Keep-Alive: 12 (6.0/sec) > Duplicates: 149 (74.5/sec) > Off-site links: 74 (37.0/sec) > Total Bytes: 92,056 (46028.0/sec) > Total Docs: 13 (6.5/sec) > Unique URLs: 13 (6.5/sec) > http://www.admiralmotorinn.com.au/index.php?pageid=3746 - Using HTML2 > parser - (193 words) > > Warning: External program returned zero Content-Length when processing > file'http://www.admiralmotorinn.com.au/index.php?pageid=3746' > http://www.admiralmotorinn.com.au/index.php?pageid=3746 - Using DEFAULT > (HTML2) parser - (no words indexed) > err: External program failed to return required headers Path-Name: > . > > ////////////////////////////////////////////////////////////////////////////////////////// > Some of the URLs pulled out by the spider.pl seem to have escaped characters at the end, likely because you have a href like: href="http://something " so the extra spaces get URL-escaped. Or perhaps they are encoded that way already. In any case, there are multiple links to the problem URL with values like: http://www.admiralmotorinn.com.au/index.php?pageid=3746%0A%20%20 and those extra characters at the end are having 2 effects. (1) the same page is being fetched multiple times, and (2) the extra space throws off the length() check by one byte. Not sure if it's a spider.pl bug or not, but adding this in your test_url() sub ref fixed your particular problem: return 0 if $uri =~ m/\%20/; -- Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com _______________________________________________ Users mailing list Users@lists.swish-e.org http://lists.swish-e.org/listinfo/usersReceived on Sun Jun 14 22:07:47 2009