Skip to main content.
home | support | download

Back to List Archive

Extracting descriptions (Was: Switch-e and <META DESCRIPTION=...>

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Thu Dec 03 1998 - 23:07:20 GMT
On Tue, 1 Dec 1998, Brian Rankin wrote:

> The Perl script pulls-out the first 300 non-html characters from each
> document that matches the query, and uses that as the desription.  This
> happens interactively for each search result.  The software all resides on an
> old Sparc-10 with 48MB RAM (about as fast as a 486-66).  Not too shabby for a
> Perl script...

	I'd like to include a Perl library function to do just this in
	the next release of SWISH++ so people don't have to keep
	reinventing the wheel.  Here's what I've come up with:

-------------------------------------------------------------------------------
sub extract_description {
	my $path_name = shift;
	open( FILE, $path_name ) or die "can not open $path_name\n";
	read( FILE, $_, 1024 );			# pull in chunk o' text
	close FILE;
	s/\s+/ /g;				# turn into one big long string

	if ( $path_name =~ /\.s?html?$/ ) {	# file is HTML
		return $3 if /<META\s[^>]*?NAME\s*=\s*(['"])?description\1?\s[^>]*?CONTENT\s*=\s*(['"])([^>]+)\2[^>]*?>/i;
		return $2 if /<META\s[^>]*?CONTENT\s*=\s*(['"])([^>]+)\1\s[^>]*?NAME\s*=\s*(?:['"])?description[^>]*?>/i;
		s!<(SCRIPT|STYLE|TITLE).*?>.*?</\1>!!gi;
		s!<(?:SCRIPT|STYLE|TITLE).*$!!i;
		s!<.*?>!!g;			# zap all other HTML tags
		s!<.*$!!g;			# zap partial trailing tag
	}
	s/^\s+//;				# zap leading whitespace
	s/^((?:[^\s]+\s+){100}).*/$1/;		# return at most 100 words
	return $_;
}
-------------------------------------------------------------------------------

	For HTML files, if it contains a META description, take that
	string and be done with it; otherwise, strip out all text
	between SCRIPT, STYLE, and TITLE begin/end tags, and then strip
	all remaining tags.

	At this point for an HTML file (or initially for plan text
	files), throw away all but the first 100 words (at most).

	Sound reasonable?

	- Paul
Received on Thu Dec 3 15:08:10 1998