Skip to main content.
home | support | download

Back to List Archive

Re: Anyone using the C library or SWISH.pm module?

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Oct 03 2002 - 13:46:12 GMT
On Thu, 3 Oct 2002, Alex Lyons wrote:

> When searching multiple indexes there doesn't seem to be an interface
> to provide the index number or a pointer to the index that each result
> entry came from.  This would be Really Useful so that I could look up
> the index metainfo for each result (IndexName, IndexPointer, etc)
> which I would like to factor into the generated results display.

This is one of the areas I wanted to improve.  Any input you can provide
would be helpful.

With the swish-e binary the index headers are printed (one set per index)
before displaying the results.  In the library you should have access to
the headers after opening the index files (even before running a search).

So some ideas I'll throw out -- comments?

      my $sh = SWISHE->new( $index1, $index2 );
      my $headers = $sh->headers;

      print $headers->{$index1}{WordCharacters};

or maybe more flexible:

      my $headers = $sh->headers( $index1 );
      print $headers->{WordCharacters};

Or even OO style (seems overkill for something that really is a hash);

     print $headers->WordCharacters;


So with a result:

     my $results = $sh->search( $query_string );

     while ( $result = $results->next ) {
        my $index = $result->property('swishdbfile');
        my $wc = $headers->{$index}{WordCharacters};

Or maybe more natural:

       my $wc = $results->header( 'WordCharacters' );


The idea of the redesign is that by letting $results go out of scope that
perl (in the XS code) will automaitcally free up all the memory used by a
search.  That should really simplify the perl API.

You should also be able to do something like:

     my $sh = SWISHE->new( $index );
     my $search = $sh->NewSearch;

     my $results1 = $search->query( $query1 );
     my $results2 = $search->query( $query2 );

Not sure why you would want to have more than one results set at a time
(no, there's no way to AND or OR those sets at this time).

But it does mean you can prepare a search (which includes the query,
HTML strucutre, -L limit params, and sort order) and run searches using
that prepared search object. 


> How about 2 C libraries: one containing just the stuff required to do
> a search, the other containing all the parsers and stuff needed to
> generate the index.  Then link swish-e to both, but only link
> swish-search to the search lib.  This would give you a "standard"
> swish-e as a general indexer and search tool, and a lightweight
> swish-search for search-only use in CGI.

Yep, that's basically done.  The library size is now about 1MB instead of
slightly more than 2MB.  Right now I'm building two libraries -- one has
all the code and one has mostly search code.


The code is not completely separated - for example there's docprop.c which
is mostly for fetching doc properties, but also includes a small amount of
code for encoding a property before it's written to disk.  We can separate
more as time goes on, but I'm not too concerned about a few 100K byes here
and there.  Most machines should share that code.

I'd like to use libtool and build a .so library for swish, too.  I don't
know libtool, so it's a bit of time to make that change.

> Separate Perl APIs for each library, so those using mod_perl for
> searching wouldn't be loading all the indexing stuff into their httpd
> address space.  Maybe don't even need the Perl API for the indexing
> library unless you're planning to rewrite the Perl spiders to generate
> the index directly rather than piping to swish-e -S prog

Again, that's done.  Should not longer have libxml2 loaded in the search
library -- that helps.

I don't really see a need for a library interface to the indexing code.  



-- 
Bill Moseley moseley@hank.org
Received on Thu Oct 3 13:53:58 2002