On Mon, Feb 09, 2004 at 07:45:14AM -0800, Peter Karman wrote:
> It appears that currently, using the HTML2 parser does not increment the
> word position each time it reaches a HTML block tag. libxml2 defines
> block tags as:
I thought it did, but I see that all I'm doing is adding separation
between words if the tag is not "inline".
For fun you might modify parser.c from:
else if ( !element->isinline )
append_buffer( &parse_data->text_buffer, " ", 1 ); // could flush buffers
to:
else if ( !element->isinline )
{
printf("Not inline tag <%s>", tag );
append_buffer( &parse_data->text_buffer, " ", 1 );
}
or something like that and see if that shows true for all your block
elements (it's suppose to, I think).
IIRC, the parser builds a buffer of the text. The buffer is flushed
(i.e. sent to swish for indexing) when it gets full or on metaname
changes. A starting word position is passed along with the buffer,
which is returned incremented. So the steps required to bump the word
position (for say a block element tag) are to flush the buffer and then
bump the word position (parse_data->word_pos).
So, it may be as easy as replacing that append_buffer() call above with:
{
flush_buffer( parse_data, 1 );
parse_data->word_pos++;
}
Are there cases where a phrase should match across a block level
element?
> My question is: should phrase matching really work across something like
> a <p> tag? Or across a <h\d> tag?
Ya, that's the question. Probably not.
> In looking at the parser.c code, I see that it seems to be possible to
> implement something like a BumpPositionCounteronHTMLBlocks (NO|yes)
> config option or something like that, but before I jumped in and tried
> to hack that bit, I wanted to throw it out there and see if there some
> piece of logic that I'm missing.
If nobody can come up with a reason that it should't bumy the we don't
need a config option.... ;)
--
Bill Moseley
moseley@hank.org
Received on Mon Feb 9 08:33:25 2004