On Nov 6, 2012 at 6:06 AM, Arend-Jan Wijtzes <ajwyt...@wise-guys.nl<mailto:ajwyt...@wise-guys.nl>> wrote: ... During the uninvert phase of this text field the searchers experience long stalls because of the garbage collecting (20+ seconds pauses) which causes Solr to lose the Zookeeper lease. Often they do not recover gracefully and as a result the cluster becomes degraded:
"SEVERE: There was a problem finding the leader in zk:org.apache.solr.common.SolrException: Could not get leader props" This is an known open issue. <warning: commercial product mention follows> Using the Zing JVM is simple, immediate way to get around this and other known GC related issues. Zing eliminates GC pauses as a concern for enterprise applications such as this, driving worst case JVM-related hiccups down to the milliseconds level. This behavior will tend to happen out-of-the-box, with little or no tuning, and at any heap size your server can support. For example, on the specific serverconfigurations you mention (24 vcores, 48GB of RAM) you should be able to comfortably run with a -Xmx of 30GB and no longer worry about pauses. We've had people run much larger than that (e.g. http://blog.mikemccandless.com/2012/07/lucene-index-in-ram-with-azuls-zing-jvm.html). In full disclosure, I work for (and am the CTO at) Azul. -- Gil.