Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] Passing username and password when spidering restricted websites

From: Troy Wical <troy(at)not-real.wical.com>
Date: Wed Jun 16 2010 - 05:07:49 GMT
On Jun 15, 2010, at 7:30 PM, Peter Karman wrote:

> Troy Wical wrote on 6/15/10 9:09 AM:
>> Had my down time, now getting back into this again. This time it's for the workplace. We have several internal documentation sites, and search all of them individually can be a pain. So I decided to spider all of them and make them all searchable via swish.cgi.  I have it working fairly well so far, but am having a hard time spidering sites that require authentication.  All the sites are being indexed individually, and this is the basic conf that I am using:
>> 
>> ###############################
>> 
>> IndexDir spider.pl
>> SwishProgParameters default http://restricted-website.com/dir/index.php 
>> IndexFile /path/to/indexes/restricted-website.index
>> StoreDescription HTML* <body> 200000
>> 
> 
> Instead of "default" above you need to create a spider config file and put
> "credentials" in it:
> 
> http://swish-e.org/docs/spider.html#credentials


Gave that a shot, but no luck. Below is the config I am working with.

###########################
@servers = (
        {
            base_url    => 'http://restricted-website.com',
            email       => 'my@email.com',
            delay_sec   => '0',
            credentials => 'username:password',
        },
    );
###########################

When I run "./spider.pl spider.config > output.txt" I get the following:

###########################
Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl line 38.
Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl line 38.
/usr/lib/swish-e/spider.pl: Reading parameters from 'spider.config'
Warning: document 'http://restricted-website.com' has no content

Summary for: http://restricted-website.com
Connection: Close: 1  (1.0/sec)
      Total Bytes: 1  (1.0/sec)
       Total Docs: 1  (1.0/sec)
      Unique URLs: 1  (1.0/sec)
###########################

Now, there are two things that I have noticed. When I login to this website via browser, the url end in dashboard.action, as opposed to something more common like .php etc. Also, the pop up window to login is being handled by a second url that takes care of all the authentication. I'm wondering if this isn't throwing a curve ball to swish-e when it comes to logging in.

Troy
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Wed Jun 16 01:07:52 2010