Skip to main content.
home | support | download

Back to List Archive

Re: Unable to spider certain pages, md5 problem

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Wed Sep 08 2004 - 15:07:02 GMT
On Tue, Sep 07, 2004 at 11:47:11PM -0700, Tim Hartley wrote:
> 
> Hiya,

Howdy!

> Spider.pl is not accessing certain ASP pages

Spider.pl has some debugging code you can turn on to see why a page
isn't spidered.  You can enable debugging from either the spider's
config file or via environment variables.

Spider.pl will print a message to stderr when a link is skipped, with
the reason why.  Other possibilities are that the page is not linked
from anywhere or that it's only linked via javascript and not a normal
html link.

Debugging should not be hard -- first set the base_url to that page in
question and see what happens.  If that works, then point base_url to
the page that should link to that page and see what happens.  You can
run the spider like:

   /path/to/spider.pl /path/to/spider/config > output.txt

to capture the output without using swish.  Oh, I assume since your
mailier doesn't wrap text that you are using Windows, so in that case
maybe you have to run 

  perl /path/to/spider.pl /path/to/spider/config > output.txt


> Due to the layout of the website, some articles and areas are
> duplicated in more than one area of the site. Some examples are: 

The spider compares URLs to look for duplicates, but when more than
one URL points to the same resource you can use MD5 checksums to try
and find duplicates. 

Another way to go is to normalize your urls which you can do in a
test_url() callback.  That is, if you have author=John_Doe and
author=john_doe pointing to the same page, then you might decide to
normalized everything to lowercase.   If you can do that it will
likely be slightly faster than calculating MD5s for all the content.

If you use MD5 checking in the spider then it will report the number
of MD5 matches found at the end of spidering -- plus if you have the
spider log skipped URLs then each will be printed as they are skipped.

If you have two docs that either match or don't match as you think
they should, then you should be able to use the LWP utility program
"GET" to fetch the docs and compare them and see why they do or don't
match.

   GET http://server/path/to/doc.html

I'm not sure what's available on Windows but you might look for
programs called md5sum or diff.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Wed Sep 8 08:07:40 2004