Re: Extracting descriptions

From: Jacques Delsemme <jacques(at)not-real.cats.UCSC.EDU>
Date: Fri Dec 04 1998 - 21:24:18 GMT
Thanks for doing this.  I am testing your routine at one of our sites:

and here are my experiences with it:

1. I had to increase the number of characters read to 2048 characters, 
otherwise the extract often disappeared altogether after the 
eliminations of the various tags at the start of a document (meta tags 
and other proprietary tags inserted automatically by some web editors).

2. By the same token, I've decreased the number of words returned to no 
more than 50.

3. I've inserted the line:

	s/<!--.*-->//gi;                    # remove comments tags

to remove comments tags.  I do this first.

4. You are using the "description" meta tag to extract the description 
of the page.  Is this use universal?  I'm curious to learn whether 
there is a well-defined standard (I plead ignorance about this), or 
whether there is a variety of meta tags in use (e.g. "abstract", 

