At 01:29 PM 08/01/02 -0700, Don Fike wrote:
>
>It appears that the difference in the word count is coming from HTML
>returning words with in comment tags <!-- --> and HTML2 does not return
>these.
Ok, yes, that seems to be one of the places the HTML parser is broken. If
the comment tags include HTML it seems like the ">" is enough to confused
the parser into thinking the comment has ended.
> cat 1.html
<html>
<body>
<!-- comment <b>with</b> html -->
Bodyword
</body>
</html>
> ./swish-e -v0 -i 1.html -T indexed_words
Adding:[1:swishdefault(1)] 'with' Pos:1 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'html' Pos:2 Stuct:0x9 ( BODY FILE )
Adding:[1:swishdefault(1)] 'bodyword' Pos:3 Stuct:0x9 ( BODY FILE )
The HTML "parser" is a mess. I got tired of trying to patch it so that's
why HTML2 was created.
--
Bill Moseley
mailto:moseley@hank.org
Received on Thu Aug 1 22:52:57 2002