Skip to main content.
home | support | download

Back to List Archive

Language and Swishspider

From: Antonio Cisternino <cisterni(at)not-real.Di.Unipi.IT>
Date: Wed May 26 1999 - 16:42:29 GMT
It is possible to substitute the swishspider with the following one?
I've added just three lines to support the language request to web servers.
Our site uses the MultiViews option of Apche and the server serves a
page depending on the language requested (we support it and en).
I want to use swish over HTTP but I want build two indexes: one for
english pages and the other for italian pages.
Thus I've added the following lines to swishspider helper:

my $language = $ENV{SWISH_LANG};
$language ||= "en"; # These two lines sets the language (en is the default).
my $header_lang = new HTTP::Headers(Accept_language => $language);

These lines decide the language to use looking for the SWISH_LANG environment
variable. If this variable it is not defined "en" is assumed and an additional
header for the HTTP::Request is created.
Finally I've changed the line that build the HTTP request as follows:

my $request = new HTTP::Request( "GET", $url, $header_lang );

Now the spider can be controlled by the environment for the language
to use.

-- Antonio

Swishspider:

#!/usr/bin/perl

use LWP::UserAgent;
use HTTP::Headers;
use LWP::RobotUA;
use HTTP::Request;
use HTTP::Status;
use HTML::LinkExtor;

if (scalar(@ARGV) != 2) {
    print STDERR "Usage: SwishSpider localpath url\n";
    exit(1);
}

my $language = $ENV{SWISH_LANG};
$language ||= "en";

my $header_lang = new HTTP::Headers(Accept_language => $language);

my $ua = new LWP::UserAgent;
$ua->agent( "SwishSpider" );
$ua->from( "ron\@ckm.ucsf.edu" );

my $localpath = shift;
my $url = shift;

my $request = new HTTP::Request( "GET", $url, $header_lang );
my $response = $ua->simple_request( $request );

#
# Write out important meta-data.  This includes the HTTP code.  Depending on the
# code, we write out other data.  Redirects have the location printed, everything
# else gets the content-type.
#
open( RESP, ">$localpath.response" ) || die( "Could not open response file $localpath.response" );
print RESP $response->code() . "\n";
if( $response->code() == RC_OK ) {
    print RESP $response->header( "content-type" ) . "\n";
} elsif( $response->is_redirect() ) {
    print RESP $response->header( "location" ) . "\n";
}
close( RESP );

#
# Write out the actual data assuming the retrieval was succesful.  Also, if
# we have actual data and it's of type text/html, write out all the links it
# refers to
#
if( $response->code() == RC_OK ) {
    my $contents = $response->content();

    open( CONTENTS, ">$localpath.contents" ) || die( "Could not open contents file $localpath.contents\n" );
    print CONTENTS $contents;
    close( CONTENTS );

    if( $response->header("content-type") eq "text/html" ) {
	open( LINKS, ">$localpath.links" ) || die( "Could not open links file $localpath.links\n" );
	$p = HTML::LinkExtor->new( \&linkcb, $url );
	$p->parse( $contents );

	close( LINKS );
    }
}


sub linkcb {
    my($tag, %links) = @_;
    if (($tag eq "a") && ($links{"href"})) {
	my $link = $links{"href"};

	#
	# Remove fragments
	#
	$link =~ s/(.*)#.*/$1/;
	
	#
	# Remove ../  This is important because the abs() function
	# can leave these in and cause never ending loops.
	#
	$link =~ s/\.\.\///g;
	
	print LINKS "$link\n";
    }
}
Received on Wed May 26 09:39:32 1999