Skip to main content.
home | support | download

Back to List Archive

Re: PHRASE SEARCH (fwd)

From: Jose Manuel Ruiz <jmruiz(at)not-real.boe.es>
Date: Mon Apr 17 2000 - 10:01:39 GMT
Roy,


It sounds OK for me. 

One month ago, I look at swish-e to use it for indexing 
several databases. It was a small and very nice package,
well implemented, simple to modify and with some cgi scripts.
But, it had not phrase searching ... and I need it.
I took a look at the code and I thought it was easy to
add this feature.

But when I began working, looking at the code, I saw that
not all patches had been applied, there were many memory
leaks and some of the code could be really improved for
better performance and better portable storage. 

Perhaps I am making many modifications with the only advice
of Bill Moseley and SRE. So, it is time to stop and redefine
where do we want to go.

Since my last message to this list, I have added the following
new features and patches:

- Better compression scheme. In my own test case I save up 
to 25 % of disk space. I think compress was using an unneded 
null extra byte.

- More portable "DOC PROPERTIES". Now, the integers are 
stored in the portable format of compress function. So, the
final index file is "more portable" between platforms. May 
I say "portable" instead of "more portable"?

- Compress function is now implemented as a macro. Now 
indexing is faster.

- Sorting of results by properties (-s option). -s is for 
sort. Only  descending sort is implemented but you can use 
more than one property.
Make it ascend will be very simple if it is considered 
useful. The sort implementation is just based in qsort 
and strcasecmp. You can sort by date if you store it in
yyyymmdd format.

- Show a range of results (-b option) based on patch from 
Scott Schultz. (-b is for begin). Combined with -m you
can get m results starting at b.

- More working in memory.

These all are not available because I am testing them at 
this moment. Give me one day and I will make it available.

Some other patches to add:
- Filter patch from Rainer Scherg.

To do (this comes to my mind at this moment):
- Better "or". When long results are returned it is slow 
(This is the case of search of the type a* or b* or c*...) 
I am not sure if it will worth the effort.
- Better phrase search. Stopwords, colons etc. what to do 
with them? 
- A near operator? 
- An apache module? (This needs more working on memory. 
We must free ALL the memory used by each search). 
- A get_document function for highlighting words?
- All words are treated as a string in de index file. Can
we add a numeric data type to the index? This can help in 
searching with <, <=, =,>=,> for numbers and dates 
(stored as a number).

Waiting to hear from all of you...

Jose Ruiz
jmruiz@boe.es

Roy Tennant wrote:
> 
> I would like to propose the following:
> 
> 1) That those who have been active lately on upgrading SWISH-E for phrase
> searching, memory leaks, etc. settle on a limited set of additional
> enhancements to the phrase searching version of SWISH-E.
> 2) That a *coordinated* team of volunteers take on those limited set of
> enhancements (or it could be one brave person) and implement them.
> 3) That once those enhancements have been achieved the software is
> thoroughly tested by a wide range of users (us).
> 4) That upon fixing any bugs that this (phrase searching version plus any
> other enhancements that are folded in) become SWISH-E 2.0.
> 5) That manual updates and changes be submitted by those who added the
> capabilities (so someone like me who doesn't read C doesn't make a mess of
> it)
> 
> The key part of this to me is to settle on a *limited* set of enhancements
> to become 2.0, and not get into the same creeping featuritis that plagues
> the Mozilla effort. How does this sound? Thanks,
> Roy Tennant
> 
> ---------- Forwarded message ----------
> Date: Fri, 14 Apr 2000 08:06:28 -0700 (PDT)
> From: Jose Manuel Ruiz <jmruiz@boe.es>
> To: Multiple recipients of list <swish-e@sunsite.berkeley.edu>
> Subject: [SWISH-E] Re: PHRASE SEARCH
> 
> Yesterday, I added some patches to the code:
> 
> - notoperator (Bill Moseley)
> - metaStopWord (Adrian Mugnolo)
> - memoryleak and memoryleak2 (Richard Beebe and Marc Perrin)
> - spider (Steve van der Burg)
> - all getmeta patches (and improvements!!) (Tom Brown)
> 
> All of them will be available soon, in next package.
> 
> Not applied yet ...
> - paging (Scott Schultz).
> - filters (Rainer Scherg).
> - language (Adrian Mugnolo).
> 
> These are the patches I found in
> http://sunsite.berkeley.edu/SWISH-E/Patches
> 
> SRE wrote:
> >
> > At 06:32 AM 4/13/00 -0700, Bill Moseley wrote:
> > >Did you look at the http://sunsite.berkeley.edu/SWISH-E/Patches/ directory
> > >for any patches that could/should be included?  I'm not sure is SRE
> > >included any in his version or not, other than the stemmer.c bug.
> >
> > I didn't include anything from the Patches directory, which
> > never occurred to me, but now I'm wishing I had! The phrase
> > search version would be a good place to correct that mistake.
> >
> > SRE
> >
> > mailto:eckert(at)not-real.climber.org | http://www.climber.org/eckert/
> > Info on peak climbing email lists mailto:info@climber.org
> >
> > "I know God will not give me anything I can't handle.
> >  I just wish that He didn't trust me so much."
> >      -- Mother Teresa
> 
> --
> 
> Jose Manuel Ruiz Ramos
> 
> jmruiz@boe.es
Received on Mon Apr 17 06:03:18 2000