> Judith Retief wrote on 11/20/07 3:54 AM:
>> However, my index files differ hugely in size: the merged index files add
up
>> to about 80M, the incremental index files almost 600M! What's going on?
>
>> Is there anything that I could be doing wrong to be generating these huge
>> files?
> Without seeing examples of your configs and the merge commands you are
> running, it's hard to speculate.
>
> One guess is that at merge time duplicates are being tossed out. But
> that size difference seems too significant.
>
> IME, the index size varies a lot based on the number/size/compression
> of the properties I am storing.
> --
> Peter Karman . http://peknet.com/ . peter(at)not-real.peknet.com
It's not my intent to have other people debug my code, but if anyone is
willing to have a look at this to see if I'm doing anything ridiculously
wrong I'd appreciate it.
I use exactly the same config file for the two indexes. I index the same
content set for the two runs, and there are no duplicates.
This is the swish.config file:
===============================================
IndexContents XML* .xml
ParserWarnLevel 3
MetaNames content title body abstract attachment_data
content.u_effective_date content.u_expiration_date content.u_updated_date
PropertyNamesDate content.u_effective_date content.u_expiration_date
content.u_updated_date
PropCompressionLevel 9
MinWordLimit 3
UndefinedMetaTags index
UndefinedXMLAttributes index
IgnoreWords file: /home/cms/cass/indexer/stopwords/english.txt
IgnoreNumberChars 0123456789$.,
=================================================
Our app selects content items in batches of 50 from a database queue, and
for each set
it opens the swish pipe, indexes the 50 itmes, and closes the pipe again.
The merging version looks like this (it's TCL code)
--------------------------------------------------
set swish_index [open "|swish-e -v3 -S prog -i stdin \
-c ./swish.config \
-f /tmp/temp.index" w]
Then, for each of the content items:
set id (get the data id from the database)
set data (read the data from the database)
set date [now]
set content_length [string length $data)
puts $swish_index "Path-Name: \'$id\'"
puts $swish_index "Content-Length: $content_length"
puts $swish_index "Last-Mtime: [clock format [clock scan $date] -format
%s]"
puts $swish_index "Document-Type: XML*\n"
puts $swish_index $data
And after indexing the set, we merge the master and temp to a temp merged
file, which is then the new master:
exec swish-e -M "/data/merge_index/index.swish-e" /tmp/temp.index
/tmp/merged.index
exec mv /tmp/merged.index "/data/merge_index/index.swish-e"
exec mv /tmp/merged.index.prop "/data/merge_index/index.swish-e.prop"
exec mv /tmp/merged.index.btree "/data/merge_index/index.swish-e.btree"
exec mv /tmp/merged.index.array "/data/merge_index/index.swish-e.array"
exec mv /tmp/merged.index.file "/data/merge_index/index.swish-e.file"
exec mv /tmp/merged.index.psort "/data/merge_index/index.swish-e.psort"
exec mv /tmp/merged.index.wdata "/data/merge_index/index.swish-e.wdata"
exec rm /tmp/temp.index
(and remove the rest of the temp index files likewise)
The incremental index version looks like this
---------------------------------------------
Firstly one has to create an initial index file set by indexing one item
without specifying Update-Mode (Update-Mode: Index assumes there's an
existing file):
set swish_index [open "|swish-e -v3 \
-S prog -i stdin
-c ./swish.config
-f /data/incremental_index/index.swish-e" w]
and then you index one item using:
puts $swish_index "Path-Name: \'id\'"
puts $swish_index "Content-Length: $content_length"
puts $swish_index "Last-Mtime: [clock format [clock scan $date]] -format
%s]"
puts $swish_index "Document-Type: XML*\n"
puts $swish_index $dataset swish_index [open "|swish-e -v3 -u \
-S prog -i stdin \
-c ./swisn.config \
-f /data/incremental_index/index.swish-e" w]
After bootstrapping like this, we kick of the true incremental indexing:
set swish_index [open "|swish-e -v3 -u \
-S prog -i stdin \
-c ./swish.config \
-f /data/incremental_index/index.swish-e" w]
And for each of the 50 items:
puts $swish_index "Path-Name: \'id\'"
puts $swish_index "Update-Mode: Index"
puts $swish_index "Content-Length: $content_length"
puts $swish_index "Last-Mtime: [clock format [clock scan $date]] -format
%s]"
puts $swish_index "Document-Type: XML*\n"
puts $swish_index $data
The search results for the two types of indexes seem to be identical - so
why would the incremental indexes files be so much larger?
_______________________________________________
Users mailing list
Users@lists.swish-e.org
http://lists.swish-e.org/listinfo/users
Received on Thu Nov 29 03:15:54 2007