Skip to main content.
home | support | download

Back to List Archive

Re: [SWISH-E:283] Re: Swish comments

From: Paul J. Lucas <pjl(at)not-real.ptolemy.arc.nasa.gov>
Date: Fri May 08 1998 - 06:47:09 GMT
On Thu, 7 May 1998, Brendan Jones wrote:

> If people search on four terms, you're providing four separate context
> outputs for each document - disconnected or potentially overlapping
> slabs of text which swamps the reader - and may provide no more
> illumination than my preferred method of document title and first 100
> words.  All IMO, of course.

	Of course.

> My interface won't chop out content between <SCRIPT>...</SCRIPT>, but then
> again, no site I maintain has any javascript or applets.

	And let everybody else eat cake, I suppose.

> Chopping out content between <SCRIPT>...</SCRIPT> would be a relatively
> trivial perl modification if I ever needed to implement it.  Not messy at
> all.

	Not as trivial as you think:

		<SCRIPT>
		document.write( "</SCRIPT>" );
		document.write( "oops!" );
		</SCRIPT>

	Granted, it's a pathelogical case, but my background is in
	compilers and it's the pathalogical cases that are the fun
	ones.  (You have to balance quotation marks to get the above
	right.)

> > 	I personally throw out stopwords alltogether ...
> 
> In many cases doing this is fine, but in some situations, the stopword
> adds nonzero value.  For example, say someone does an AND search for
> "fee fie foe foo" and "foe" is a stopword that has been thrown out because
> it is too common.
> 
> In my view, the correct way to implement this search is to first find all
> documents which contain fee, fie and foo.  So far, this is what you've
> implemented.  But that output could be further refined (and correctly
> refined) by then throwing out any documents in this list which do NOT also
> contain foe, even though foe is a stopword.

	You're missing the point: it a word is so common as to be
	considered a stopword, then most of the results will be thrown
	out.

> > 	SWISH++ discards comments.
> 
> Well, you've sold me on swish++ rather than swish-e on this alone!

	Incidentally, do you know if:

		<!-- "-->" -->

	is a valid comment in HTML?  (I had to look this one up
	myself.)  My interpretation of the HTML 4.0 spec. is that it is
	NOT a valid comment since quotes are not to be treated
	specially, i.e., balanced.

	- Paul J. Lucas
	  NASA Ames Research Center		Caelum Research Corporation
	  Moffett Field, California		San Jose, California
	  <pjl AT ptolemy DOT arc DOT nasa DOT gov>
Received on Thu May 7 23:56:09 1998