My recollection is that the spider perl script provides for HTTP BASIC
or DIGEST authentication. If you want to spider a web application
that requires login credentials, then in addition to using cookies,
you need to provide credentials to the application. For me it was
easier for me to create a back door in my app by which swish-e spider
could authenticate via GET; human users would enter credentials and
then POST the form.
On Jun 15, 2010, at 10:07 PM, Troy Wical wrote:
> On Jun 15, 2010, at 7:30 PM, Peter Karman wrote:
>
>> Troy Wical wrote on 6/15/10 9:09 AM:
>>> Had my down time, now getting back into this again. This time it's
>>> for the workplace. We have several internal documentation sites,
>>> and search all of them individually can be a pain. So I decided to
>>> spider all of them and make them all searchable via swish.cgi. I
>>> have it working fairly well so far, but am having a hard time
>>> spidering sites that require authentication. All the sites are
>>> being indexed individually, and this is the basic conf that I am
>>> using:
>>>
>>> ###############################
>>>
>>> IndexDir spider.pl
>>> SwishProgParameters default http://restricted-website.com/dir/index.php
>>> IndexFile /path/to/indexes/restricted-website.index
>>> StoreDescription HTML* <body> 200000
>>>
>>
>> Instead of "default" above you need to create a spider config file
>> and put
>> "credentials" in it:
>>
>> http://swish-e.org/docs/spider.html#credentials
>
>
> Gave that a shot, but no luck. Below is the config I am working with.
>
> ###########################
> @servers = (
> {
> base_url => 'http://restricted-website.com',
> email => 'my@email.com',
> delay_sec => '0',
> credentials => 'username:password',
> },
> );
> ###########################
>
> When I run "./spider.pl spider.config > output.txt" I get the
> following:
>
> ###########################
> Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl
> line 38.
> Use of uninitialized value in sprintf at /usr/lib/swish-e/spider.pl
> line 38.
> /usr/lib/swish-e/spider.pl: Reading parameters from 'spider.config'
> Warning: document 'http://restricted-website.com' has no content
>
> Summary for: http://restricted-website.com
> Connection: Close: 1 (1.0/sec)
> Total Bytes: 1 (1.0/sec)
> Total Docs: 1 (1.0/sec)
> Unique URLs: 1 (1.0/sec)
> ###########################
>
> Now, there are two things that I have noticed. When I login to this
> website via browser, the url end in dashboard.action, as opposed to
> something more common like .php etc. Also, the pop up window to
> login is being handled by a second url that takes care of all the
> authentication. I'm wondering if this isn't throwing a curve ball to
> swish-e when it comes to logging in.
>
> Troy
> _______________________________________________
> Users mailing list
> Users@lists.swish-e.org
> http://lists.swish-e.org/listinfo/users
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Jun 18 02:30:16 2010