Skip to main content.
home | support | download

Back to List Archive

RE: HTML vs. HTML2

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Aug 01 2002 - 22:49:27 GMT
At 01:29 PM 08/01/02 -0700, Don Fike wrote:
>
>It appears that the difference in the word count is coming from HTML
>returning words with in comment tags <!--  -->  and HTML2 does not return
>these.

Ok, yes, that seems to be one of the places the HTML parser is broken.  If
the comment tags include HTML it seems like the ">" is enough to confused
the parser into thinking the comment has ended.

> cat 1.html
<html>
<body>
<!-- comment <b>with</b> html -->
Bodyword
</body>
</html>

> ./swish-e -v0 -i 1.html -T indexed_words
    Adding:[1:swishdefault(1)]   'with'   Pos:1  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'html'   Pos:2  Stuct:0x9 ( BODY FILE )
    Adding:[1:swishdefault(1)]   'bodyword'   Pos:3  Stuct:0x9 ( BODY FILE )

The HTML "parser" is a mess.  I got tired of trying to patch it so that's
why HTML2 was created.




-- 
Bill Moseley
mailto:moseley@hank.org
Received on Thu Aug 1 22:52:57 2002