Skip to main content.
home | support | download

Back to List Archive


From: Don Fike <fike(at)>
Date: Thu Aug 01 2002 - 20:29:05 GMT
It appears that the difference in the word count is coming from HTML
returning words with in comment tags <!--  -->  and HTML2 does not return


-----Original Message-----
From: Bill Moseley []
Sent: Thursday, August 01, 2002 3:09 PM
To:; Multiple recipients of list
Subject: Re: [SWISH-E] HTML vs. HTML2

At 11:23 AM 08/01/02 -0700, Don Fike wrote:
>Doing indexing with HTML2 I get fewer words indexed than with HTML.
>Isn't HTML2 the recommended parser?  Is there a known reason for the

Yes, they are different parsers.

I've recommended this before, but build both indexes, then do something like

  ./swish-e -f index1 -T INDEX_WORDS_ONLY | sort > index1.words
  ./swish-e -f index2 -T INDEX_WORDS_ONLY | sort > index2.words
  diff index1.words index2.words

Then use swish to lookup the files of the words that don't match.  You will
likely find why the HTML2 parser is better (or rather where HTML parser is

Bill Moseley
Received on Thu Aug 1 20:35:57 2002