Skip to main content.
home | support | download

Back to List Archive

RE: lucene/plucene

From: Alexander Korth <alexander.korth(at)not-real.dai-labor.de>
Date: Mon May 09 2005 - 16:34:32 GMT
Hi,

as an answer I want to tell you about some experimental results, which could be interesting for users planning to adress large document collections.

I applied both, Swish-E and Lucene to an IR project to support a bunch of self-developed indexing/search agents. The arriving requests are delegated by several filter managers to a particular filter agent by some coordination and cooperation strategies which take into account not only the provided functionality (phrase search, BEs) but also system/HDD/DB load and user ratings for former results provided. The underlying collection consists of about 750.000 documents (~7GB).

Comparing result sets:
Using Swish-E with standard options (e.g. w/o -R), I made some tests intersecting the returned document sets for several queries. Therefore, I triggered both, Swish-E and Lucene with the same query, cropped the result set to the best 50 docs on both sides and intersected them. The document collection was about (210.000 docs, ~2GB). For single term queries 30% of the documents recommended by Lucene and Swish-E were commons. That value of course decreased slightly when adding more terms to an OR'ed query but I still was 9% using 25 terms, which is impressing. When AND'ing terms no better results could be archived because the single terms where selected from a DB arbitrarily and simply did not appear in any document together. A bigger value than 30% can certainly be archived when forming smarter restrictive queries which causes a much smaller initial document set before cropping and resulting in a bigger intersection.

Querying time effort:
Using queries with 1-250 terms (AND and OR, resp.) the following observations could be made (time to launch JRE not included):
- time spend on 1-term query: Lucene 500ms, Swish-E 800ms
- time spend on 25-term OR query: Lucene 1200ms, Swish-E 1200ms
- time spend on 25-term AND query: Lucene 750ms, Swish-E 780ms
- time spend on 250-term OR query: Lucene 4400ms, Swish-E 8750ms
(system: Linux, 4x Opteron 850 2.4GHz, 16GB RAM)

Index size and indexing time effort:
I did not figure out significant differences here. They are both quick and produce very small index files (~15% of the orig. size).

I hope that someone of you profits of these information. Thanks to the developers of this great tool.

Regards,
Alex

Dipl.-Inform. Alexander Korth
DAI-Labor - Technische Universität Berlin
Sekretariat GOR 1-1, Franklinstraße 28/29, 10587 Berlin 
Fon: +49 30 314 25318
Fax: +49 30 314 21799
alexander.korth@dai-labor.de
http://www.dai-labor.de

 
Received on Mon May 9 09:34:37 2005