Zookeeper will not update cluster state when garbaging

OSMAN Metin Mon, 10 Mar 2014 03:37:14 -0700

Hi all,

we are using SolrCloud with this configuration :


*         SolR 4.4.0

*         Zookeeper 3.4.5

*         one server with zookeeper + 4 solr nodes

*         one server with 4 solr nodes

*         only one core

*         Solr instances deployed on tomcats with mod_cluster

*         clients access with SolRJ trough Apache + mod_cluster

On the morning, we have massive updates (several thousands in a few minute) 
with explicit softCommit=true.
This updates are load balanced on each regardless a node is the leader or not.

When this happens, the solr cloud admin console shows 7 nodes as recovering and 
the leader as active.
We also noticed, that refreshing the graphic is very long.
This situation can last 3 hours until the clusterstate refreshes.
During this phase, Zookeeper is hardly garbaging (I can post the Munin gc 
graphs).

Here are the command line parameters of zookeeper and solr nodes (I have 
replaced some values with XXX for confidentiality reason).

Zookeeper :

java -cp 
/var/lib/zookeeper/bin/../build/classes:/var/lib/zookeeper/bin/../build/lib/*.jar:/var/lib/zookeeper/bin/../lib/slf4j-log4j12-1.6.1.jar:/var/lib/zookeeper/bin/../lib/slf4j-api-1.6.1.jar:/var/lib/zookeeper/bin/../lib/netty-3.2.2.Final.jar:/var/lib/zookeeper/bin/../lib/log4j-1.2.15.jar:/var/lib/zookeeper/bin/../lib/jline-0.9.94.jar:/var/lib/zookeeper/bin/../zookeeper-3.4.5.jar:/var/lib/zookeeper/bin/../src/java/lib/*.jar:/app/zookeeper/conf:
 -Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.port=XXX -Xms384m -Xmx384m -XX:MaxPermSize=128m 
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false 
org.apache.zookeeper.server.quorum.QuorumPeerMain /app/zookeeper/conf/zoo.cfg

SolR :

/usr/lib/jvm/java/bin/java -Dsolr.data.dir=/app/solr/server/search_01/vod/data 
-Dsolr.solr.home=/app/solr/server/search_01 -DnumShards=1 
-Dbootstrap_confdir=/app/solr/server/search_01/vod/conf 
-Dcollection.configName=vod -DzkHost=XXX:2181 -Dtomcat.server.port=XXX 
-Dtomcat.http.port=XXX -Dtomcat.ajp.port=XXX 
-Dlog4j.configuration=file:///app/tomcat/server/search_01/conf/log4j.properties 
-Djboss.jvmRoute=SEARCH_02_01 -Djboss.modcluster.sendToApacheDelayInSec=10 
-Djboss.modcluster.nodetimeout=30 -Djboss.modcluster.ttl=10 -Xms2048m -Xmx2048m 
-XX:MaxPermSize=384m -Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=XXX 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false -classpath 
:/app/tomcat/server/search_01/bin/bootstrap.jar:/app/tomcat/server/search_01/bin/tomcat-juli.jar:/usr/share/java/commons-daemon.jar
 -Dcatalina.base=/app/tomcat/server/search_01 
-Dcatalina.home=/app/tomcat/server/search_01 -Djava.endorsed.dirs= 
-Djava.io.tmpdir=/app/tomcat/server/search_01/temp 
-Djava.util.logging.config.file=/app/tomcat/server/search_01/conf/log4j.properties
 -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager 
org.apache.catalina.startup.Bootstrap start

I have tried other gc strategies, max heap values, new ratio, etc... on 
Zookeeper without success.
Every time zookeeper is garbaging, the clusterstate is not correct.

Is this a bug with zookeeper, SolR 4.4.0 or is it due to some misconfiguration ?
I have seen somewhere that there is a timeout value between solr and zookeeper, 
but I don't know where it is set (and what is its default value).

Any help will be appreciated.

Regards,
Metin

Zookeeper will not update cluster state when garbaging

Reply via email to