Skip to main content.
home | support | download

Back to List Archive

Re: robots.txt

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Oct 31 2005 - 14:57:32 GMT
On Mon, Oct 31, 2005 at 06:49:46AM -0800, J Robinson wrote:
> The actual complaint is that the spider is indexing
> pages it shouldn't.

Right -- I had this complaint once and it turned out to be a syntax
error in the robots.txt file.


> I'll check out the 'skipped' debug flag -- is there
> another that actually shows urls being compared
> against the robots.txt contents?

the spider just uses LWP::RobotUA which uses WWW::RobotRules.  Those
are widely used so should work as expected.

Try setting in spider:

    use LWP::Debug 'debug+';

although you might get more info that you want if spidering a lot of
file.  I typically just hack away at the module and throw in prints to
see what's happening.

-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Mon Oct 31 06:57:33 2005