Hi Bill,
On 14 Nov 2000, at 7:31, Bill Moseley wrote:
> Hi Jose,
>
> Again, I'm not a C programmer, and I don't understand much. But I
> have a few questions that aren't really related to swish but to the
> code
>
> I still have questions about stemmer.c. You now allocate and free
> memory if the stemmed words ends up longer. This causes a few
> problems for me in that my Stemmer perl module now will need mem.c.
> Not a huge problem. But I'm still not clear on why reallocating the
> memory is even needed. Seems like we have a MAXWORDLEN constant that
> can be used. Back to this question below.
>
Well, it is just about the way I like to code. I do not like constants
like MAXWORDLEN because the modules are not reusable in other
projects unless you define that constants inside the files. So, if you
want to use the stemmer.c function in another project, you have to
remember that there is a constant with some length. If the constant
does not exist, no worry about it.
Using constants instead of variables for passing info to the
subroutines is not a clean way to comunicate modules.
But, of course, this is just my opinion.
> Let me first back up and ask some questions about index.c and search.c
> where Stem() is called:
>
> In index.c you do this:
> static int lenword=0;
> static char *word=NULL;
> char *wordconv=NULL;
> int bump_position_flag=0;
> struct indexfile *indexf=sw->indexlist;
>
> i=0;
> if(!lenword) word = (char *)emalloc((lenword=MAXWORDLEN) +1);
>
> What's the reason for the if ( !lenword )? Isn't !lenword always true
> there?
lenword is 0 the first time the function is executed. It is a static var,
so, its value is preserved between calls. So, memory only need to
be allocated once: the first time the function is executed (lenword is
0). This should make the code faster but, in the other hand, it is not
thread safe (this is the index proccess. search proccess is thread
safe).
>
> Now later you do this:
> if (i == lenword) {
> lenword *= 2;
> word =erealloc(word,lenword +1);
> }
>
> What scares me is that we have this constant MAXWORDLEN == 1000. But
> you are now saying if the word is bigger go ahead and double the
> amount of memory allocated and continue. So MAXWORDLEN is not really
> the max length of the word?
>
Right, for me MAXWORDLEN is not really the max length of the
word, it is the size of the initial buffer to store it. If it is big enough
(eg 1000), you will not need to reallocate the size of buffer (faster).
Anyway, you do not have to worry about the max length of the word,
if it is 1001, one extra byte will be allocated. I have tried to make the
code more secure to avoid buffer overruns. Imagine that someone
calls your CGI script with a word of 1001 characters wich is greater
than MAXWORDLEN. Well, we can do two things to handle it:
- Truncate the word up to MAXWORDLEN like swish-e 1.3.3 (1.3.2 +
Patches)
- Reallocate the word like swish-e 2.x
I rather prefer the second one.
If we do nothing, a core with the subsequent actions will occur.
> Now move over to search.c:
>
> Here you take a search word, remove the "*", Stem it, then if needed,
> allocate some more space. Maybe I'm really confused, but it looks to
> me like you never assign the new pointer value to "word".
>
> if (indexf->header.applyStemmingRules)
> {
> /* apply stemming algorithm to the search term */
> i=strlen(word)-1;
> if(i && word[i]=='*') word[i]='\0';
> else i=0; /* No star */
> Stem(&word,&lenword);
> if(i)
> {
> tmp=emalloc(strlen(word)+2);
> strcpy(tmp,word);
> strcat(tmp,"*");
> efree(word);
> tmp=word;
> Here you created a new memory block in tmp, and then copied the
> contents of "word" into "tmp". Then free() is called on "word", but
> then word is copied to tmp? Now tmp points to free()d memory, no?
>
> Am I confused? You never set word to the new pointer value?
> Shouldn't it be
> word = tmp;
>
You are right. This is a bug!!!. It should be word=tmp;
>
> Now, back to stemmer.c. I'm not sure I understand the need to do all
> the memory allocation in stemmer.c, and I worry about having a
> constant called MAXWORDLEN, but allow words (and stemmed words) to be
> larger. Plus, it would seem faster not to have to allocate and free
> memory all the time.
>
> Do you see a reason to say that words can be longer than MAXWORDLEN?
> If not then why not just allocate variables on the stack that are
> MAXWORDLEN when needed? Stack variables are thread safe, right? It's
> a lot faster than asking the system for more memory all the time.
>
> In stemmer.c one could just check that the stemmed word was not longer
> than MAXWORDLEN and return the original word if it was (and return
> false as the original Stem() returned to indicate failure).
>
> Maybe I don't understand what it means to be thread-safe, but I can't
> imagine that stack variables are not thread safe.
>
Well, I do not use stemmer.c, and you know it much better. So you
are probably right; in this module it has nonsense doing these
things with word's buffers. I can go back to original stemmer.c
schema (using MAXWORDLEN) to avoid calling the allocation
routines.
cu
Jose
Received on Wed Nov 15 10:03:27 2000