Re: Different number of indexed words when indexing large mount of data

From: Bill Moseley <moseley(at)>
Date: Wed Apr 26 2006 - 13:44:05 GMT
On Wed, Apr 26, 2006 at 06:35:56AM -0700, Rodolfo Martinez wrote:
> I found who was causing this behavior, it was the libxml2 library.
> I replaced TXT2 and HTML2 by TXT and HTML, respectively, in the configuration
> file. Now I'm getting the same number of indexed words _always_.
> SWISH-E's internal parses requiere _much_ memory than libxml2 parser but it
> always work as expected.

Well, it may produce the same word count, but I would not say that it
works as expected.  that HTML parser can easily get confused and
produce incorrect results.

If libxml2 is producing different results each time it's run then
that's something to take up with the libxml2 list.  But, as that's
such a widely used library I'd want to be sure about the problem
before posting on that list.

libxml2 has utilities for testing it on files.  It might be worth
seeing if you can reproduce what you are seeing outside of swish.

Can you generate a *small* test case that shows the problem?

Bill Moseley

Received on Wed Apr 26 06:44:05 2006