On Tue, Nov 23, 2004 at 09:52:38AM -0800, Jon Sorensen wrote:
> I have been trying to spider a site like so:
>
> my %serverA =3D (
> base_url =3D> =
> 'http://www.generac-portables.com/pressure_washers/pressure_washer.cfm?id=
> =3D183&use=3D&price=3D&psi=3D&order=3D1',
> keep_alive =3D> 0,
> test_url =3D> sub {
> my $uri =3D shift;
> if ($uri->path =3D~ /pressure_washer\.cfm/){
> return 1 ;}
> else {return 0;}
> },
> use_md5 =3D> 1,
> max_files =3D> 30, =20
> );
>
>
> @servers =3D ( \%serverA, );
>
> #######################################
>
> In the output, swish was getting hung up on "&psi=3D" in the query =
> string.
> It was converting it to the character entity of the greek alphabet "Psi" =
> (ψ)
> and getting caught in an infintite loop:
Interesting. It's hard to see what's going -- seems like the list
server doesn't deal with quoted printable mail.
I also tried spidering and I didn't see any problems with psi.
my %serverA = (
base_url => 'http://www.generac-portables.com/pressure_washers/pressure_washer.cfm?id=183&use=&psi=&order=1',
keep_alive => 0,
email => 'moseley@hank.org',
test_url => sub {
my $uri = shift;
if ($uri->path =~ /pressure_washer\.cfm/){
return 1 ;}
else {return 1;}
},
use_md5 => 1,
delay_sec => 0,
max_files => 30,
);
@servers = ( \%serverA, );
moseley@bumby:~$ /usr/local/lib/swish-e/spider.pl spider.conf >xx
/usr/local/lib/swish-e/spider.pl: Reading parameters from 'spider.conf'
moseley@bumby:~$ fgrep Path-Name xx
Path-Name: http://www.generac-portables.com/pressure_washers/pressure_washer.cfm?id=183&use=&psi=&order=1
Path-Name: http://www.generac-portables.com/generators/index.cfm
Path-Name: http://www.generac-portables.com/pressure_washers/index.cfm
Path-Name: http://www.generac-portables.com/where_to_buy/index.cfm
Path-Name: http://www.generac-portables.com/service_support/faq/index.cfm
Path-Name: http://www.generac-portables.com/index.cfm
Path-Name: http://www.generac-portables.com/pressure_washers/pw_basics.cfm
Path-Name: http://www.generac-portables.com/pressure_washers/pw_project_tips.cfm
Path-Name: http://www.generac-portables.com/pressure_washers/glossary.cfm
Path-Name: http://www.generac-portables.com/pressure_washers/pressure_washer.cfm?order=2&id=214&use=&price=&ppsi=
moseley@bumby:~$ swish-e -i stdin -S prog -v0 < xx
moseley@bumby:~$ swish-e -w not dkdkdkd -x '%p\n'
# SWISH format: 2.5.2
# Search words: not dkdkdkd
# Removed stopwords:
# Number of hits: 10
# Search time: 0.017 seconds
# Run time: 0.036 seconds
http://www.generac-portables.com/pressure_washers/pressure_washer.cfm?id=183&use=&psi=&order=1
http://www.generac-portables.com/pressure_washers/glossary.cfm
http://www.generac-portables.com/pressure_washers/pw_project_tips.cfm
http://www.generac-portables.com/pressure_washers/pw_basics.cfm
http://www.generac-portables.com/index.cfm
http://www.generac-portables.com/service_support/faq/index.cfm
http://www.generac-portables.com/where_to_buy/index.cfm
http://www.generac-portables.com/pressure_washers/index.cfm
http://www.generac-portables.com/generators/index.cfm
http://www.generac-portables.com/pressure_washers/pressure_washer.cfm?order=2&id=214&use=&price=&ppsi=
.
--
Bill Moseley
moseley@hank.org
Unsubscribe from or help with the swish-e list:
http://swish-e.org/Discussion/
Help with Swish-e:
http://swish-e.org/current/docs
swish-e@sunsite.berkeley.edu
Received on Tue Nov 23 10:22:55 2004