Skip to main content.
home | support | download

Back to List Archive

Re: New versions of swish-e 2.x

From: <jmruiz(at)not-real.boe.es>
Date: Wed Nov 15 2000 - 10:01:55 GMT
Hi Bill,


On 14 Nov 2000, at 7:31, Bill Moseley wrote:

> Hi Jose,
> 
> Again, I'm not a C programmer, and I don't understand much.  But I
> have a few questions that aren't really related to swish but to the
> code
> 
> I still have questions about stemmer.c.  You now allocate and free
> memory if the stemmed words ends up longer.  This causes a few
> problems for me in that my Stemmer perl module now will need mem.c. 
> Not a huge problem.  But I'm still not clear on why reallocating the
> memory is even needed.  Seems like we have a MAXWORDLEN constant that
> can be used.  Back to this question below.
> 

Well, it is just about the way I like to code. I do not like constants 
like MAXWORDLEN because the modules are not reusable in other
projects unless you define that constants inside the files. So, if you
want to use the stemmer.c function in another project, you have to 
remember that there is a constant with some length. If the constant
does not exist, no worry about it.
Using constants instead of variables for passing info to the 
subroutines is not a clean way to comunicate modules. 

But, of course, this is just my opinion. 

> Let me first back up and ask some questions about index.c and search.c
> where Stem() is called:
> 
> In index.c you do this:
> static int lenword=0;
> static char *word=NULL;
> char *wordconv=NULL;
> int bump_position_flag=0;
> struct indexfile *indexf=sw->indexlist;
> 
>  i=0;
>  if(!lenword) word = (char *)emalloc((lenword=MAXWORDLEN) +1);
> 
> What's the reason for the if ( !lenword )?  Isn't !lenword always true
> there?

lenword is 0 the first time the function is executed. It is a static var, 
so, its value is preserved between calls. So, memory only need to 
be allocated once: the first time the function is executed (lenword is 
0). This should make the code faster but, in the other hand, it is not 
thread safe (this is the index proccess. search proccess is thread 
safe).

> 
> Now later you do this:
>     if (i == lenword) {	
>      lenword *= 2;
>      word =erealloc(word,lenword +1);
>     }
> 
> What scares me is that we have this constant MAXWORDLEN == 1000.  But
> you are now saying if the word is bigger go ahead and double the
> amount of memory allocated and continue.  So MAXWORDLEN is not really
> the max length of the word?
> 

Right, for me MAXWORDLEN is not really the max length of the 
word, it is the size of the initial buffer to store it. If it is big enough 
(eg 1000), you will not need to reallocate the size of buffer (faster). 
Anyway, you do not have to worry about the max length of the word, 
if it is 1001, one extra byte will be allocated. I have tried to make the 
code more secure to avoid buffer overruns. Imagine that someone 
calls your CGI script with a word of 1001 characters wich is greater 
than MAXWORDLEN. Well, we can do two things to handle it:
- Truncate the word up to MAXWORDLEN like swish-e 1.3.3 (1.3.2 + 
Patches)
- Reallocate the word like swish-e 2.x
I rather prefer the second one.
If we do nothing, a core with the subsequent actions will occur.

> Now move over to search.c:
> 
> Here you take a search word, remove the "*", Stem it, then if needed,
> allocate some more space.  Maybe I'm really confused, but it looks to
> me like you never assign the new pointer value to "word".
> 
>  if (indexf->header.applyStemmingRules)
>  {
>   /* apply stemming algorithm to the search term */
>   i=strlen(word)-1;
>   if(i && word[i]=='*') word[i]='\0';
>   else i=0; /* No star */
>   Stem(&word,&lenword);
>   if(i) 
>   {
>    tmp=emalloc(strlen(word)+2);
>    strcpy(tmp,word);
>    strcat(tmp,"*"); 
>    efree(word);
>    tmp=word;
> Here you created a new memory block in tmp, and then copied the
> contents of "word" into "tmp". Then free() is called on "word", but
> then word is copied to tmp?  Now tmp points to free()d memory, no?
> 
> Am I confused?  You never set word to the new pointer value? 
> Shouldn't it be
>        word = tmp;
> 

You are right. This is a  bug!!!. It should be word=tmp;

> 
> Now, back to stemmer.c.  I'm not sure I understand the need to do all
> the memory allocation in stemmer.c, and I worry about having a
> constant called MAXWORDLEN, but allow words (and stemmed words) to be
> larger.  Plus, it would seem faster not to have to allocate and free
> memory all the time.
> 
> Do you see a reason to say that words can be longer than MAXWORDLEN? 
> If not then why not just allocate variables on the stack that are
> MAXWORDLEN when needed?  Stack variables are thread safe, right?  It's
> a lot faster than asking the system for more memory all the time.
> 
> In stemmer.c one could just check that the stemmed word was not longer
> than MAXWORDLEN and return the original word if it was (and return
> false as the original Stem() returned to indicate failure).
> 
> Maybe I don't understand what it means to be thread-safe, but I can't
> imagine that stack variables are not thread safe.
> 

Well, I do not use stemmer.c, and you know it much better. So you 
are probably right; in this module it has nonsense doing these 
things with word's buffers. I can go back to original stemmer.c 
schema (using MAXWORDLEN) to avoid calling the allocation 
routines.

cu
Jose
Received on Wed Nov 15 10:03:27 2000