On 3-Feb-08, at 1:34 PM, Stu Hood wrote:
I just finished watching this talk about a column-store RDBMS,
which has a long section on column compression. Specifically, it
talks about the gains from compressing similar data together, and
how lazily decompressing data only when it must be processed is
great for memory/CPU cache usage.
http://youtube.com/watch?v=yrLd-3lnZ58
While interesting, its not relevant to Lucene's stored field
storage. On the other hand, it did get me thinking about stored
field compression and lazy field loading.
Can anyone give me some pointers about compressThreshold values
that would be worth experimenting with? Our stored fields are often
between 20 and 300 characters, and we're willing to spend more time
indexing if it will make searching less IO bound.
Field compression can save you space and converts the field into a
binary field, which is lazy-loaded more efficiently than a string
field. As for the threshold, I use 200 on a multi-kilobyte field,
but this doesn't mean that it isn't effective on smaller fields.
Experimentation on small indices followed by claculating the avg.
stored bytes/docs is usually fruitful.
Of course, the best way to improve performance in this regard is to
store the less-frequently-used fields in a parallel solr index. This
only works if the largest fields are the rarely-used ones, though
(like retrieving the doc contents to create a summary).
-Mike