Skip to main content.
home | support | download

Back to List Archive

incrementing word position on HTML block tags

From: Peter Karman <karman(at)not-real.cray.com>
Date: Mon Feb 09 2004 - 15:46:08 GMT
I found several threads on word position in the archives, but none 
specifically on HTML block tags. This is a follow up question to my 
questions last week on the difference between using the HTML2 and XML2 
parsers.

It appears that currently, using the HTML2 parser does not increment the 
word position each time it reaches a HTML block tag. libxml2 defines 
block tags as:

pre
p
div
dl
center
blockquote

etc. Also, all the h\d heading tags are included in that definition.

My question is: should phrase matching really work across something like 
a <p> tag? Or across a <h\d> tag?

For example:
===============
<body>
<h1>some title</h1>
<p>some text</p>
</body>
===============

a phrase search for "title some" will match.

I realize that HTML is mostly tagged for what it should /look/ like and 
not what it means, but this seems counterintuitive to me. I realize that 
there are various config options to control some of the bumping features 
(BumpPositionCounterCharaters, etc.), but these seem to ignore HTML tags 
(which I assume, from staring at parser.c, are parsed prior to the 
evaluation of the Bump).

In looking at the parser.c code, I see that it seems to be possible to 
implement something like a BumpPositionCounteronHTMLBlocks (NO|yes) 
config option or something like that, but before I jumped in and tried 
to hack that bit, I wanted to throw it out there and see if there some 
piece of logic that I'm missing.

Anyone?

thanks.

pek

-- 
Peter Karman - Software Publications Engineer - Cray Inc
phone: 651-605-9009 - mailto:karman@cray.com
Received on Mon Feb 9 07:46:09 2004