Skip to main content.
home | support | download

Back to List Archive

Re: Different number of indexed words when indexing large mount of data

From: Rodolfo Martinez <macr111080(at)not-real.yahoo.com.mx>
Date: Wed Apr 26 2006 - 13:39:31 GMT
Hi,

I found who was causing this behavior, it was the libxml2 library.

I replaced TXT2 and HTML2 by TXT and HTML, respectively, in the configuration
file. Now I'm getting the same number of indexed words _always_.

SWISH-E's internal parses requiere _much_ memory than libxml2 parser but it
always work as expected.

Thanks again for you support,
Rodolfo

--- Bill Moseley <moseley@hank.org> wrote:

> On Mon, Apr 24, 2006 at 08:56:51AM -0700, Rodolfo Martinez wrote:
> > Hi Bill,
> > 
> > Thanks for your response. I tried indexing just those files and got the
> same
> > keywords. I got this behavior only when indexing all information. I have
> > hundreds (thousands?) of files in the same situation.
> > 
> > I extracted the keywords and saw how they differ but I didn't get any clue.
> 
> Then maybe it's the count that is suspect?  I'm not sure what to tell
> you.
> 
> > I have other question, does the previous indexed file affect in some way
> the
> > current indexing process?
> 
> Nope.
> 
> -- 
> Bill Moseley
> moseley@hank.org
> 
> Unsubscribe from or help with the swish-e list: 
>    http://swish-e.org/Discussion/
> 
> Help with Swish-e:
>    http://swish-e.org/current/docs
>    swish-e@sunsite.berkeley.edu
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 
Received on Wed Apr 26 06:39:37 2006