Skip to main content.
home | support | download

Back to List Archive

Observations, problems, etc......

From: Net Virtual Mailing Lists <mailinglists(at)>
Date: Thu Aug 11 2005 - 06:21:37 GMT

I've spent the last month or so (since I subscribed to the mailing list)
trying to replace my current Postgres-based search engine with Swish-E. 
 There are a number of observations I'd like to make, constructively, in
the hopes of making it a better product.  Please keep in mind, I am not a
C programmer as I write these or else I would likely have tried to solve
some of these (and in some cases I did, sort of).  This is going to be a
long email and I want to apologize ahead for that.

In no particular order, here they are:

#1. I cannot seem to find a PHP module that works with Swish-E.  I have
tried to use <>, but this just
does not compile on FreeBSD.  I get to the point where I am re-compiling
PHP and start getting errors like: /opt2/netvirtu/php-4.3.7/ext/swishe/
swishe.c:31: syntax error before `*'.  Additionally, I had to use
build_conf --force, because I am not running PHP from CVS (nor can I
imagine ever doing so).  Its not clear to me from the documentation if
this is what I should have done or not, perhaps that is where my problem
originates from.

. I know this is probably not the place to discuss problems with this
particular module and I don't mean to do so.  What I mean to say is: an
"official" PHP module would be a welcome addition.  Is there something I
am missing somewhere here?

#2. The PERL API seems to be quite robust, but it looks like the only
real way to interface this to my PHP script is through system commands
and if I'm going to do that, it seems better to just call the swish-e
executable directly rather then deal with the overhead of Perl
mothership.  Because it is so command-line oriented, I'm not quite sure
how to effectively deal with issues of command-line length limits (I
could probably figure it out if I can get this to work, but I haven't
been able to yet).  What, you say, how can I possibly be hitting these
limits?  Well, the reason is, because part of the requirements I have is
to search via a zip code radius.  So first I compile a list of all the
zip codes and then do a (god awful) -w 'zip_code=(11111 OR 22222 OR 33333
OR ...)'.  I'm not even sure what sort of performance implication this
would have, because I can't get it to work (more about this later).  I
don't know what I'm asking for here... An internal ZipCode datatype?...
Perhaps passing in a latitude, longitude, and radius and having it return
records that fall within it?  I am probably mentioning two subjects in
one here.

#3. It would be really nice if it were possible to just output all the
records without specifying a search of any sort.  The reason for this is:
easy of integration.  It would be very nice to just be able to go to
Swish-E all the time without first having to decide if its appropriate to
find data from the database system or Swish-E and then have to write
separate queries for each.

#4. I mentioned this in a prior thread. What I'm really looking for is
some sort of summarization *inside* the search engine itself.  I'm not a
C programmer but I hacked together something to test my theory that this
can be done more efficiently inside Swish-E instead of crunching through
all the results with an external script.  The results were pretty clear.
 By adding some summarization code to sort_single_index_results and
sort_results I was able to achieve almost no additional overhead for
this.  My code is a total hack-job and I'm not comfortable running it on
a production system.  But, I now know this can be done.  I really wish
someone could help implement this feature.

Basically what it is:

I have a PropertyName/MetaName which is "category" and it is of the form
"A.B.C.D".  Each item is assigned to a category (or multiple categories).
 What I need to get is a count of the number of records that fall into
each category.  Here is an example output of what my code changes
produces.  The first result is due to the -m 1 (0 doesn't seem to work),
what follows it is the code which generates the summarization:

su-2.05a$ /usr/local/bin/swish-e -m 1 -w 'unix'
# SWISH format: 2.4.3
# Search words: a*
# Removed stopwords: 
# Number of hits: 100
# Search time: 0.051 seconds
# Run time: 0.131 seconds
1000 296204 "296204" 2132 
a.d.c.b 1
a.d.c.d 1
a.d.c.e 1
a.d.c.f 2
a.d.c.g 3 1
a.e.c.e 1
a.e.b.c 2
a.e.b.h 1
a.b.e.b 2
a.b.c.e 1
a.b.c.j 1
a.b.b.b 2
a.b.b.e 7
a.b.b.f 1
a.b.b.g 18
a.b.b.h 23
a.c.f.c 1
a.f.b.b 4
a.c.e.c 1
a.c.e.e 1
a.c.c.d 3
a.c.c.e 3
a.c.c.f 1
a.c.b.c 3
a.c.b.j 15

This runs only hundredths of a second slower then before I added this
summarization code.  However, if I implement it using the Perl API, the
runtime climbs up to about 3 seconds (30 times longer) because the Perl
script has to process each result and build the summarization.  This is
on my intentionally very slow development system.  I am sure that much
better performance would be had on the production system, but it seems
like sort_results gets called and has to look at every result anyways. 
If someone just wishes to get a summarization (group by?) back this seems
like a reasonable place to implement it with little additional overhead.
 My tests, sort-of, confirm that this is a more efficient implementation
of this.

I am happy to send my code to someone who would be interested in cleaning
it up and making it an official part of the Swish-E feature set.  The
problem is, I'm just not sure how to best implement it and I'm quite
certain the way I did it here is pretty bad (everything is hardcoded,
generated output is ambiguous, etc).

#5. Up until this point, I think, I've been talking mostly about "feature
requests".  But, all these considerations aside, I am now stuck.  I'm
hoping someone can help.  It seems I can't find records which have a
MetaName that is numeric?

su-2.05a$ /usr/local/bin/swish-e -m 1 -w 'zip=55'
# SWISH format: 2.4.3
# Search words: zip=55
# Removed stopwords: 
err: No search words specified

su-2.05a$ /usr/local/bin/swish-e -m 1 -w 'zip=a94132'
CUR: a
# SWISH format: 2.4.3
# Search words: zip=a55
# Removed stopwords: 
err: no results

The "CUR:" line is something I had printing out cur_token->line around
line 850 (the first example doesn't get to the printf because of the if
statement which preceeds it):

            if ( isMetaNameOpNext(cur_token->next) )     
            printf("CUR: %s\n", cur_token->line);

In the first case when I specified a number it doesn't seem to get past
isMetaNameOpNext. I don't think the problem is here, but somewhere the
code seems to be removing numbers from the search words?  Any idea why or
what I can to prevent this?


- Greg
Received on Wed Aug 10 23:21:39 2005