Do you see anything in the solr logs as to what the trigger for your nodes changing state was? You should see some kind of error/warning before the election is triggered. My gut feeling would be loss of communication between your leader and ZK (possibly by a GC event that locks the JVM for a while) but that's pure conjecture given you haven't given a lot of information.
What is your ZK timeout? You are seeing a 6s GC event, so if that is locking the JVM for that long, and your ZK timeout is less than that, it is likely that ZK thinks that node has gone away, so it forces an election to find a new leader. But there should be evident of that in the logs, you should see the ZK connection drop. On 28 August 2013 08:25, sling <sling...@gmail.com> wrote: > hi, > I have a solrcloud with 8 jvm, which has 4 shards(2 nodes for each shard). > 1000 000 docs are indexed per day, and 10 query requests per second, and > sometimes, maybe there are 100 query requests per second. > > in each shard, one jvm has 8G ram, and another has 5G. > > the jvm args is like this: > -Xmx5000m -Xms5000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m > -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=4 > -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 > -XX:+UseCMSCompactAtFullCollection -XX:+PrintGCDateStamps -XX:+PrintGC > -Xloggc:log/jvmsolr.log > OR > -Xmx8000m -Xms8000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m > -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=8 > -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5 > -XX:+UseCMSCompactAtFullCollection -XX:+PrintGC -XX:+PrintGCDateStamps > -Xloggc:log/jvmsolr.log > > Nodes works well, but also switch state every day (at the same time, gc > becomes abnormal like below). > > 2013-08-28T13:29:39.140+0800: 97180.866: [GC 3770296K->2232626K(4608000K), > 0.0099250 secs] > 2013-08-28T13:30:09.324+0800: 97211.050: [GC 3765732K->2241711K(4608000K), > 0.0124890 secs] > 2013-08-28T13:30:29.777+0800: 97231.504: [GC 3760694K->2736863K(4608000K), > 0.0695530 secs] > 2013-08-28T13:31:02.887+0800: 97264.613: [GC 4258337K->4354810K(4608000K), > 0.1374600 secs] > 97264.752: [Full GC 4354810K->2599431K(4608000K), 6.7833960 secs] > 2013-08-28T13:31:09.884+0800: 97271.610: [GC 2750517K(4608000K), 0.0054320 > secs] > 2013-08-28T13:31:15.354+0800: 97277.080: [GC 3550474K(4608000K), 0.0871270 > secs] > 2013-08-28T13:31:31.258+0800: 97292.984: [GC 3877223K(4608000K), 0.1551870 > secs] > 2013-08-28T13:31:34.396+0800: 97296.123: [GC 3877223K(4608000K), 0.1220380 > secs] > 2013-08-28T13:31:38.102+0800: 97299.828: [GC 3877225K(4608000K), 0.1545500 > secs] > 2013-08-28T13:31:40.227+0800: 97303.019: [Full GC > 4174941K->2127315K(4608000K), 6.3435150 secs] > 2013-08-28T13:31:49.645+0800: 97311.371: [GC 2508466K(4608000K), 0.0355180 > secs] > 2013-08-28T13:31:57.645+0800: 97319.371: [GC 2967737K(4608000K), 0.0579650 > secs] > > even more, sometimes a shard is down(one node is recovering, another is > down), that is an absolute disaster... > > please help me. any advice is welcome... > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/why-does-a-node-switch-state-tp4086939.html > Sent from the Solr - User mailing list archive at Nabble.com. >