Skip to main content.
home | support | download

Back to List Archive

a better way to index?

From: Hup Chen <hup(at)not-real.addall.com>
Date: Fri Mar 07 2003 - 01:04:38 GMT
Hi,

I would suggest to split the content of the entire database into several
smaller units, index each one seperately, then merge it together at a
later time. I believe that this method will be faster as opposed to
directly indexing the whole entire database.  However, I discovered that
indexing by smaller units is faster and more effective up to a certain
limit. This method seems to lose its effectiveness after that point is
reached.  In my test result suggest that by using the new method, the
index time can be saved as much as 50%!

  I didn't test it using -S fs or http, not sure if the index performance
can be improved.


Test data:
2M+ (2061722) book database life data. Average record size is about 500B.
=> 2061722 files indexed.  1173167897 total bytes.  133925121 total words.



Hardware:
4GB RAM (no swap problem), dual 2.4G XEON CPU, a Linux dedicated index
server.



Swish-e version: 2.2.3



Swish-e configuration file: only 2 lines, no IgnoreWords.

MetaNames title author price comment
PropertyNames title author price usprice comment dealer currency url



Method 1:
index the whole 2M database
# $parse $source | $swishe -i stdin -S prog -f $pathOut/$test1 -c
swish.conf


Method 2:
split 2M records database into several smaller database, index each small
database one by one, and merge all smaller indexes into one big index.

foreach $i (@smaller) {
   `$parse $small.$i | $swishe -i stdin -S prog -f $pathOut/$test2.$i -c
swish.conf`;
   $merge .= " $pathOut/$test2.$i ";
}
`$swishe -M $merge $pathOut/$test2`


--------------------------------------------------------------------------------------------

Result:
Method 1: took 3 hours 13 minutes
Start_time: Sat Feb 15 15:05:49 PST 2003
Indexing Data Source: "External-Program"
Indexing "stdin"
Removing very common words...
no words removed.
Writing main index...
Sorting words ...
Sorting 2996117 words alphabetically
Writing header ...
Writing index entries ...
   Writing word text:  80%
   Writing word text: Complete
   Writing word hash:  80%
   Writing word hash: Complete
   Writing word data:  79%
   Writing word data: Complete
2996117 unique words indexed.
Sorting property: twishlastmodified
Sorting property: comment
12 properties sorted.

2061722 files indexed.  1173167897 total bytes.  133925121 total words.
Elapsed time: 03:13:13 CPU time: 00:-2:-55
Indexing done!
End_time: Sat Feb 15 18:19:02 PST 2003


Task 2:
size 600K: 2 hours 30 minutes
size 400K: 2 hours 20 minutes
size 200K: 1 hours 50 minutes
size 100K: 1 hours 38 minutes
size 10K:  1 hours 43 minutes <- 10K is too small for 2M database


--
Hup
Technical Manager
AddALL.com

Search.  Compare.  Save.

http://www.addall.com  (new book search)
http://used.addall.com/  (used book search)
http://www.amusicarea.com  (CDs)
http://www.amoviearea.com  (DVD & VHS)
http://www.amagarea.com  (magazine subscriptions)
Received on Fri Mar 7 01:08:31 2003