Skip to main content.
home | support | download

Back to List Archive

Re: RE: Abstracting

From: Antonio Cisternino <cisterni(at)not-real.Di.Unipi.IT>
Date: Mon Jun 14 1999 - 15:53:36 GMT
> Date: Mon, 14 Jun 1999 00:58:07 -0700 (PDT)
> Reply-To: nhuillard@ghs.fr
> Originator: swish-e@sunsite.berkeley.edu
> Sender: swish-e@sunsite.berkeley.edu
> Precedence: bulk
> From: Nicolas Huillard <nhuillard@ghs.fr>
> X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
> Content-Type: text/plain; charset=unknown-8bit
> 
> Yes, this is interesting ! There have been a discussion about abstracting in this mailing list, and someone said that it wasn't so slow to generate abstracts while generating the answer page of the query (at "run-time"). But it is a strng work to modify the Perl front script to generate abstracts. If the work is already done, it is better for me.
> I agree with Ron, when he says that the abstracting should be independent of the retreiving method.
> The fact is that I use the filesystem retreiving method (to index only a well known set of files), and it would be interesting for me to have that feature...

This is the code that does the abstracting. It can be inserted in the
spider or as a separated helper.
Obviously I've used the $response->header and the $response->content to
establish the mime type and the content.
The abstract is stored in a db (an hash table). The name of the db
must be stored in the environment variable SWISH_ABSTRACT.
Obviously the code works only on Unix (via the standard DB_File package
provided by Perl distribution).
Using DB_File it is possible to access to an abstract given his URL simply
linking a perl hash (tie call) with the db and then accessing the standard
hash. If you want to look the result you can try the following URL:

http://www.di.unipi.it/search

You can try as query

cisternino

and you get 19 hits.
In our site search I've used different indexes to allow selective search but
only one abstract DB that contains all the abstracts for all the URL contained
in the various indexes.
As you can see the abstractting procedure is, for the moment, only a filter
that takes the first 200 chars after the body tag removing tags and multiple
spaces.

-- Antonio

use DB_File;

### For abstract
# Link

if( $response->header("content-type") eq "text/html" ) {
    my $cnt = $response->content();

    if ($cnt =~ /<body.*?>(.*)/si) {
        # Does the abstract of $cnt
	$cnt = $1;

	$cnt =~ s/(<.*?>|\n)//sg;
	$cnt =~ s/(\s+|\&nbsp\;)/ /sg;

	$cnt = substr($cnt, 0, 200) . "...";

	my $abstracts_db = $ENV{SWISH_ABSTRACT};
	my %abstracts;

	if ($abstracts_db) {
	    my $mode = O_RDWR;
	    $mode |= O_CREAT if (! (-x $abstracts_db));

	    tie %abstracts, 'DB_File', $abstracts_db, $mode, 0664, $DB_HASH
		or die("Couldn't open DB_File `$abstracts_db': $!");

	    $abstracts{$url} = $cnt;
	}
    }
}

### By Antonio Cisternino
Received on Mon Jun 14 08:50:20 1999