Skip to main content.
home | support | download

Back to List Archive

Re: [swish-e] parallelism and Swish-e

From: Peter Karman <peter(at)not-real.peknet.com>
Date: Sat Mar 14 2009 - 03:08:38 GMT
Andrew Smith wrote on 3/13/09 4:20 PM:
> Hi,
> 
> I'm using the latest version of Swish-e and I have it working fine,
> but I am wondering if and how Swish-e has any support for parallelism
> and multiprocessors, in particular both for indexing and searching.

In short, there is no built-in support for either.

A few years ago someone worked up a search cluster manager:
http://swishd.sourceforge.net/

I've not used it myself. It appears to have been abandoned.


> For indexing, I could just handle it myself via the prog input method
> (i.e. just fork parallel processes which each independently index part
> of a directory tree, e.g. each process is given a number N and indexes
> 1/Nth of the documents). Then I could merge the indexes at the end (or
> just pass them all to Swish-e using the -f option when searching) But
> it would be easier if I could just do this via the simple file system
> index method; is there any configuration option where you can specify
> that Swish-e only indexes every Nth file it encounters?

no

> 
> Next, for searching can Swish-e take advantage of parallelism? For
> example, does it know it is running on a multiprocessor and internally
> execute the search in parallel? If not, again, I could conceivably
> handle this myself as follows. If I want to search in parallel on,
> say, 8 processors I would create 8 separate indexes as above, each
> covering 1/8th of the files in the corpus of documents to be searched.
> Then when searching I fork 8 processes where each one independently
> searches one of the 8 separate indexes. Finally, I collate the results
> of each of these 8 parallel searches into one final result set. Would
> this work? Or would it somehow screw up relevance ranking since the
> indexes are being searched independently?

the latter. The ranking is scaled to a 1000 baseline, not a raw rank score, so
you wouldn't be able to reliably interweave the results.

Swish-e's architecture was never designed to scale the way you are describing.
You might be able to take the approach you describe and use multiple indexes. I
know that some folks have used that simply to allow for multi-million document
collections.

OTOH, you might look at Swish3, since the Xapian backend can scale for
distributed searching[1].

[1] http://xapian.org/docs/remote.html




-- 
Peter Karman  .  http://peknet.com/  .  peter(at)not-real.peknet.com
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Fri Mar 13 23:08:32 2009