Skip to main content.
home | support | download

Back to List Archive

Re: Use of swish-e in BaBar high energy physics exp.

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Tue Apr 15 2003 - 21:16:59 GMT
On Tue, 15 Apr 2003, Douglas Smith wrote:

> Just a note of thanks, from the BaBar high energy physics
> experiment at the Stanford Linear Accelerator Center.  After
> several months of testing various search engine tech's, and
> getting into discussion with Inktomi and Google, we have
> found something which is making people happy: swish-e.

Thanks good.

> It also proved to be so much
> faster than other engines we have been able to up the update
> time to every 15mins (and it could probably handle every 
> 5mins),

Could you post the output from indexing some time and note what hardware
you are running on, and perhaps memory usage?

> There are a few issues in getting this to work that have
> come up:
> 
> 1. The system is almost too flexible to start, perhaps there
> should be a simpler install to get started?

Are you talking about swish-e binary or the swish.cgi script?

> 2. Is there a way people can influence the ranking of pages?
> Like through a meta tag there could be a ranking factor included?
> This was noticed that it would be nice to make initial posting
> to a discussion forum more important in ranking than replies.
> Also in search the web it would be nice for people to be able
> to decide what was the most important page, and include a meta
> tag to increase that page in ranking.  I think I saw this is
> discussion in 2.4 features perhaps, but I am not sure if I
> saw it the 2.2.3 features.

Well it hasn't happened yet.  It's somewhat hard because a metatag might
come along after words have already been indexed.  The docs are indexed in
a stream as they are parsed.  But if the meta tag is in the HEAD then it
might work ok.  Rank is calculated at search time.  I probably could be
calculated at indexing time and adjusted at search time.  Or there could
be a special property added to each file that is a bias on a file's
overall rank.  It's all kind of big design changes -- swish is fast
because it tries hard not to do too much work... ;)

> 3. The ranking on word in context work very well, and people so
> far are able to get the correct page back quickly.

Seems to work ok for searches that return a small set of results.

> But is 
> there more work being done on ranking?  Like influencing the
> ranking by number of other pages in a search that link to that
> page.  I mean if a lot of pages in a search rank link to one
> page, then that page should be considered higher in rank.

You mean like google's page rank?  There no tracking of links like that in
swish.  The data base is not set up so a page's rank can be adjusted the
more other pages link to it.

But other wise, YES, the ranking code needs work.  It's very basic code
right now.

> 5. To update the site quickly I created a second inc. update
> index.  This also gets merged into the larger index in a 
> periodic manner, and after this merge the index doesn't exist
> until the next update.  This would produce an error with your
> current swish.cgi and swish-e search, since I am asking for
> an index which doesn't exist.  I changed the swish.cgi such
> that it accepts multiple indexes for searching, but checks 
> on their existence before passing the list onto the swish-e
> executable, so there is no error.  Is this a feature which
> perhaps should be added to the swish.cgi or even swish-e, if
> there are multiple index for search, ignore one if it is not
> there instead of producing an error?

There is code now to add files to an index.
It has not been tested much, but would work good for archives.  Grab cvs
or a swish-daily and run ./configure --help

So you have something like:

   full_index + incremental_index

and once in a while you merge the incremental_index into the full_index
and then incremental_index no longer exists.  Is that the problem?
I guess I'd stat the file, too.  Or maybe create a dummy index for the
incremental_index with a small file entry with one word.  Stat()ing is
probably faster.

Cheers,

-- 
Bill Moseley moseley@hank.org
Received on Tue Apr 15 21:18:01 2003