Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:279] Swish comments

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Thu May 07 1998 - 07:30:07 GMT
On Wed, 6 May 1998, Brendan Jones wrote:

> I think providing context around search terms is unnecessary - quoting
> the first 50 or 100 words from the document is, I think, quite sufficient.
> Context becomes a difficult thing when you are doing a search on more
> than one term.  Do you provide context around all search terms?

	And why not?

> Grep is unsatisfactory and leads to disconnected output.  The title of the
> document and the opening lines generally provide enough context information.

	This also gets "messy" in that, if you want to do a good job,
	you not only have to strip out the HTML, but also anything
	between <SCRIPT>...</SCRIPT> since HTML files that contain
	scripts usually have them at the beginning.

> One of the few criticisms I had with swish seems to be addressed with
> swish-e - that is, stopwords shouldn't invalidate an otherwise possible
> search (i.e. there are other search terms in the query which are indexed).

	SWISH++ ignores stopwords in search strings.

> For an AND search, this would require that the "too common" word is still
> examined for its presence in the target documents.  Any views from the
> developers as to the feasibility of this?

	I personally throw out stopwords alltogether so there is no
	easy way to do that.  I don't see why words that are too common
	should be treated any differently from predefined stopwords
	since, for the document set, they *are* stopwords.

> Finally, I don't like the fact that swish indexes comments in HTML documents.

	SWISH++ discards comments.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Thu May 7 00:39:08 1998