Skip to main content.
home | support | download

Back to List Archive


From: Bill Moseley <moseley(at)>
Date: Thu Aug 01 2002 - 22:49:27 GMT
At 01:29 PM 08/01/02 -0700, Don Fike wrote:
>It appears that the difference in the word count is coming from HTML
>returning words with in comment tags <!--  -->  and HTML2 does not return

Ok, yes, that seems to be one of the places the HTML parser is broken.  If
the comment tags include HTML it seems like the ">" is enough to confused
the parser into thinking the comment has ended.

> cat 1.html
<!-- comment <b>with</b> html -->

> ./swish-e -v0 -i 1.html -T indexed_words
    Adding:[1:swishdefault(1)]   'with'   Pos:1  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'html'   Pos:2  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'bodyword'   Pos:3  Stuct:0x9 ( BODY FILE )

The HTML "parser" is a mess.  I got tired of trying to patch it so that's
why HTML2 was created.

Bill Moseley
Received on Thu Aug 1 22:52:57 2002