At 11:23 AM 08/01/02 -0700, Don Fike wrote:
>Doing indexing with HTML2 I get fewer words indexed than with HTML.
>Isn't HTML2 the recommended parser? Is there a known reason for the
Yes, they are different parsers.
I've recommended this before, but build both indexes, then do something like
./swish-e -f index1 -T INDEX_WORDS_ONLY | sort > index1.words
./swish-e -f index2 -T INDEX_WORDS_ONLY | sort > index2.words
diff index1.words index2.words
Then use swish to lookup the files of the words that don't match. You will
likely find why the HTML2 parser is better (or rather where HTML parser is
Received on Thu Aug 1 20:32:26 2002