Skip to main content.
home | support | download

Back to List Archive

Re: Swish-e wandering off on it's own

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Fri Feb 22 2002 - 22:41:41 GMT
At 10:32 AM 02/22/02 -0800, Darryl Friesen wrote:
>I've run across an interesting problem.  I'm using the spider.pl (with "-S
>prog" of course) to index out Intranet, which seems to work fine except
>swish-e happily wanders off and indexes our main library web pages as well.
>Our Intranet runs on the SSL port of the same machine (i.e. Intranet URLS
>are all https://library.usask.ca/ and our public pages are
>http://library.usask.ca).

Argh!  A week or so ago (for some reason I can't remember) I changed to
using the URI->host_port call, which breaks that.  Sorry.


>Is there a quick and dirty way to stop this?  I have a common set of
>callback functions for test_url and filter_content that I use for both the
>Intranet and our main server (and a few others) so I can't just "return 0"
>if the URL does not start with "https".

Why not?  You could say something like:

     return 0 if $uri->scheme eq 'https' 
              && $uri->canonical->authority eq 'library.usask.ca';

That's not tested, but you get the idea.

>I thought spider.pl would treat the URLs as being different actually, but it
>looks as if it's comparing host, not scheme/port (although I haven't really
>looked at the code; maybe I should).

Yes, it should.  It's currently only looking at the host:port.

I'll get an update out in the next 24 hours, but the test_url function is a
good place to do that kind of check.  That's why I put those callback
functions in -- so people could fix my bugs ;)

Thanks for the report!





-- 
Bill Moseley
mailto:moseley@hank.org
Received on Fri Feb 22 22:42:34 2002