Skip to main content.
home | support | download

Back to List Archive

Re: Search on metanames - internals and speed

From: Bill Moseley <moseley(at)not-real.hank.org>
Date: Thu Mar 31 2005 - 19:21:41 GMT
On Thu, Mar 31, 2005 at 01:57:21PM -0500, Brett Paden wrote:
> Is the index arranged something like:
> 
> word | in_metaname1 | in_metaname2  |  ... |  <doc_list> 

Kind of.  There's also position information used for phrase matching
and to determine the "structure" of each word positions
(title/bold/heading etc.).

> So querries that use metanames look up the entires by words, but only 
> retrun them if that entry is also marked as having been indexed under a 
> certain metaname?

Again, EVERYTHING is a metaname search.

   swish-e -w foo

is the same as:

   swish-e -w swishdefault=foo


> So, theoritically, if I stored my owner_id metaname as a property , 
> then asked swish to return all the results that matched 'america or 
> clinton' along with the owner_id property, pushed them into some sort 
> of hash keyed by the owner_id property ... I might be able to improve 
> performance (or at least reduce the 'slow' part of the swish query).

You mean instead of this:

   swish-e -w (america OR clinton) AND owner_id=foo

you do:

   swish-e -w (ameriac OR clinton)

then filter those results where owner_id is "foo"?

I'll bet swish will be faster unless you are searching a very huge
index and america OR clinton only return a few results.

> Are property lookups expensive?

They are not free.  Searching doesn't touch the property table.  Even
sorting doesn't touch the property table (because swish has pre-sorted
indexes created at indexing time for this purpose) until returning
results.  (The exception is if searching more than one index at the
same time -- then it has to read the property table if sorting results
by a non-internal (i.e. rank) property.

> Say my query returns 10,000 results, as 
> I start iterating through them one out of 10 contains an owner_id I am 
> intersted in, so to get ten "real" results I'll have to pull 100 full 
> property/document lookups off the disk.  Bad?

You have to know the owner_id at some point.

You might hope that swish does this:

   Say you have an index with 1/2 million documents.

   swish-e -w america and clinton

   1) swish first finds all records with "america" and finds 10,000
      then,
      ** doesn't work like this **
   2) swish looks in just those 10,000 to see which has clinton.

Doesn't work that way.  On number 2 swish looks for "clinton" as a new
query -- it has no way to somehow "limit" the search to just the
existing results.  For one thing, swish doesn't know what it's suppose
to do with each result yet (OR or AND), and it doesn't know if it's
also a phrase search, either.  That's all done when two result sets are
combined.

> > > Also, I've noticed that repeating the query speeds results 
> > > dramatically.
> > 
> > You are running the swish-e binary or the C/Perl interface?
> 
> Both.  The behavior I described above, however, has to do with
> command line tests using the binary.

That's common to see that speed up then with anything.  Start mozilla
and exit.  Start it again.


-- 
Bill Moseley
moseley@hank.org

Unsubscribe from or help with the swish-e list: 
   http://swish-e.org/Discussion/

Help with Swish-e:
   http://swish-e.org/current/docs
   swish-e@sunsite.berkeley.edu
Received on Thu Mar 31 11:21:41 2005