Skip to main content.
home | support | download

Back to List Archive

spidering an intranet that requires login

From: Bill Conlon <bill(at)not-real.tothept.com>
Date: Thu Aug 21 2003 - 21:41:53 GMT
I have an intranet application with which I want to use swish-e.

This app uses some cookies for persistent authentication and state 
management.  I note that spider.pl already handles cookies.

It seems to me that the way to spider the site is to start indexing at 
the login form, for example

http://myintranet.org/login.php?_function=checkpw&username=swishe&password=
spider

assuming username/password = swishe/spider for the spider.

This would cause swishe to login and receive its cookies.  Thereafter it 
would use cookies to maneuver throughout the site.

But this bombs out with

swish-e -S prog -c spider.config
Indexing Data Source: "External-Program"
Indexing "spider.pl"
sh: line 1: username=swishe: command not found
sh: line 1: .password=swishe: command not found
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'default'

I've tried some escape sequences, but in glancing through the beginning 
of spider.pl, it looks like you start with a base URI and only allow 
search arguments, I presume, afterwards.

So rather than mess with your script, it probably makes more sense to 
have a back door for the spider, so we start at:

http://myintranet.org/spider.php
this page does the login for the spider, returning the necessary cookies. 
 Of course, an .htaccess restriction is needed on this page.

So, if I do it that way, will swish-e accept a 302 redirect to index.html 
or should it return the contents of index.html?  The latter is a little 
more involved, since we don't want spider.php to appear in the index.

Any helpful hints?

Bill Conlon

To the Point
345 California Avenue Suite 2
Palo Alto, CA 94306

office: 650.327.2175
fax:    650.329.8335
mobile: 650.906.9929
e-mail: mailto:bill@tothept.com
web:    http://www.tothept.com
Received on Thu Aug 21 21:42:59 2003