On 3-Feb-08, at 1:34 PM, Stu Hood wrote:

I just finished watching this talk about a column-store RDBMS, which has a long section on column compression. Specifically, it talks about the gains from compressing similar data together, and how lazily decompressing data only when it must be processed is great for memory/CPU cache usage.

http://youtube.com/watch?v=yrLd-3lnZ58

While interesting, its not relevant to Lucene's stored field storage. On the other hand, it did get me thinking about stored field compression and lazy field loading.

Can anyone give me some pointers about compressThreshold values that would be worth experimenting with? Our stored fields are often between 20 and 300 characters, and we're willing to spend more time indexing if it will make searching less IO bound.

Field compression can save you space and converts the field into a binary field, which is lazy-loaded more efficiently than a string field. As for the threshold, I use 200 on a multi-kilobyte field, but this doesn't mean that it isn't effective on smaller fields. Experimentation on small indices followed by claculating the avg. stored bytes/docs is usually fruitful.

Of course, the best way to improve performance in this regard is to store the less-frequently-used fields in a parallel solr index. This only works if the largest fields are the rarely-used ones, though (like retrieving the doc contents to create a summary).

-Mike

Reply via email to