On Wed, Apr 06, 2011 at 12:05:57AM +0200, Jan Høydahl said: > Just curious, was there any resolution to this?
Not really. We tuned the GC pretty aggressively - we use these options -server -Xmx20G -Xms20G -Xss10M -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:SoftRefLRUPolicyMSPerMB=10 and we've played a little with CompressOops and AggressiveOpts. We also backported the MMapDirectory factory to 1.4.1 and that helped a lot. We do still gets spikes of long (5s-20s queries) a few times an hour which don't appear to be caused by any kind of "Query of Death". Occasionally (once every few days) one of the slaves will experience a period of sustained slowness but recovers by itself in less than a minute. According to our GC logs we haven't had a full GC for a long time. Currently the state of play is that we commit on our master every 5000ms and replicate from the slaves every 2 minutes. Our reponse times for searches on the slaves are about 180-270ms but if we turn off replication then we get 60-90ms. So something is clearly "up" with that. Having talked to the good people at Lucid we're going to try playing around with commit intervals, upping our mergeFactor from 10 to 25 and maybe using the BalancedSegmentMergePolicy. The system seems to be stable at the moment which is good but obviously we'd like to lower our query times if possible. Hopefully this might be of some use to somebody out there, sometime. Simon