Skip to main content.
home | support | download

Back to List Archive

Re: incrementing word position on HTML block tags

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Mon Feb 09 2004 - 16:33:25 GMT
On Mon, Feb 09, 2004 at 07:45:14AM -0800, Peter Karman wrote:
> It appears that currently, using the HTML2 parser does not increment the 
> word position each time it reaches a HTML block tag. libxml2 defines 
> block tags as:

I thought it did, but I see that all I'm doing is adding separation 
between words if the tag is not "inline".

For fun you might modify parser.c from:

        else if ( !element->isinline )
            append_buffer( &parse_data->text_buffer, " ", 1 );  // could flush buffers

to:

        else if ( !element->isinline )
        {
            printf("Not inline tag <%s>", tag );
            append_buffer( &parse_data->text_buffer, " ", 1 );
        }

or something like that and see if that shows true for all your block 
elements (it's suppose to, I think).

IIRC, the parser builds a buffer of the text.  The buffer is flushed 
(i.e. sent to swish for indexing) when it gets full or on metaname 
changes.  A starting word position is passed along with the buffer, 
which is returned incremented.  So the steps required to bump the word 
position (for say a block element tag) are to flush the buffer and then 
bump the word position (parse_data->word_pos).

So, it may be as easy as replacing that append_buffer() call above with:

   {
       flush_buffer( parse_data, 1 );
       parse_data->word_pos++;
   }

Are there cases where a phrase should match across a block level 
element?

> My question is: should phrase matching really work across something like 
> a <p> tag? Or across a <h\d> tag?

Ya, that's the question.  Probably not.

> In looking at the parser.c code, I see that it seems to be possible to 
> implement something like a BumpPositionCounteronHTMLBlocks (NO|yes) 
> config option or something like that, but before I jumped in and tried 
> to hack that bit, I wanted to throw it out there and see if there some 
> piece of logic that I'm missing.

If nobody can come up with a reason that it should't bumy the we don't 
need a config option.... ;)

-- 
Bill Moseley
moseley@hank.org
Received on Mon Feb 9 08:33:25 2004