Hi all, we are using SolrCloud with this configuration :
* SolR 4.4.0 * Zookeeper 3.4.5 * one server with zookeeper + 4 solr nodes * one server with 4 solr nodes * only one core * Solr instances deployed on tomcats with mod_cluster * clients access with SolRJ trough Apache + mod_cluster On the morning, we have massive updates (several thousands in a few minute) with explicit softCommit=true. This updates are load balanced on each regardless a node is the leader or not. When this happens, the solr cloud admin console shows 7 nodes as recovering and the leader as active. We also noticed, that refreshing the graphic is very long. This situation can last 3 hours until the clusterstate refreshes. During this phase, Zookeeper is hardly garbaging (I can post the Munin gc graphs). Here are the command line parameters of zookeeper and solr nodes (I have replaced some values with XXX for confidentiality reason). Zookeeper : java -cp /var/lib/zookeeper/bin/../build/classes:/var/lib/zookeeper/bin/../build/lib/*.jar:/var/lib/zookeeper/bin/../lib/slf4j-log4j12-1.6.1.jar:/var/lib/zookeeper/bin/../lib/slf4j-api-1.6.1.jar:/var/lib/zookeeper/bin/../lib/netty-3.2.2.Final.jar:/var/lib/zookeeper/bin/../lib/log4j-1.2.15.jar:/var/lib/zookeeper/bin/../lib/jline-0.9.94.jar:/var/lib/zookeeper/bin/../zookeeper-3.4.5.jar:/var/lib/zookeeper/bin/../src/java/lib/*.jar:/app/zookeeper/conf: -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=XXX -Xms384m -Xmx384m -XX:MaxPermSize=128m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /app/zookeeper/conf/zoo.cfg SolR : /usr/lib/jvm/java/bin/java -Dsolr.data.dir=/app/solr/server/search_01/vod/data -Dsolr.solr.home=/app/solr/server/search_01 -DnumShards=1 -Dbootstrap_confdir=/app/solr/server/search_01/vod/conf -Dcollection.configName=vod -DzkHost=XXX:2181 -Dtomcat.server.port=XXX -Dtomcat.http.port=XXX -Dtomcat.ajp.port=XXX -Dlog4j.configuration=file:///app/tomcat/server/search_01/conf/log4j.properties -Djboss.jvmRoute=SEARCH_02_01 -Djboss.modcluster.sendToApacheDelayInSec=10 -Djboss.modcluster.nodetimeout=30 -Djboss.modcluster.ttl=10 -Xms2048m -Xmx2048m -XX:MaxPermSize=384m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=XXX -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -classpath :/app/tomcat/server/search_01/bin/bootstrap.jar:/app/tomcat/server/search_01/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar -Dcatalina.base=/app/tomcat/server/search_01 -Dcatalina.home=/app/tomcat/server/search_01 -Djava.endorsed.dirs= -Djava.io.tmpdir=/app/tomcat/server/search_01/temp -Djava.util.logging.config.file=/app/tomcat/server/search_01/conf/log4j.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager org.apache.catalina.startup.Bootstrap start I have tried other gc strategies, max heap values, new ratio, etc... on Zookeeper without success. Every time zookeeper is garbaging, the clusterstate is not correct. Is this a bug with zookeeper, SolR 4.4.0 or is it due to some misconfiguration ? I have seen somewhere that there is a timeout value between solr and zookeeper, but I don't know where it is set (and what is its default value). Any help will be appreciated. Regards, Metin