Do you see anything in the solr logs as to what the trigger for your nodes
changing state was?  You should see some kind of error/warning before the
election is triggered.  My gut feeling would be loss of communication
between your leader and ZK (possibly by a GC event that locks the JVM for a
while) but that's pure conjecture given you haven't given a lot of
information.

What is your ZK timeout?  You are seeing a 6s GC event, so if that is
locking the JVM for that long, and your ZK timeout is less than that, it is
likely that ZK thinks that node has gone away, so it forces an election to
find a new leader.  But there should be evident of that in the logs, you
should see the ZK connection drop.


On 28 August 2013 08:25, sling <sling...@gmail.com> wrote:

> hi,
> I have a solrcloud with 8 jvm, which has 4 shards(2 nodes for each shard).
> 1000 000 docs are indexed per day, and 10 query requests per second, and
> sometimes, maybe there are 100 query requests per second.
>
> in each shard, one jvm has 8G ram, and another has 5G.
>
> the jvm args is like this:
> -Xmx5000m -Xms5000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m
> -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=4
> -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5
> -XX:+UseCMSCompactAtFullCollection -XX:+PrintGCDateStamps -XX:+PrintGC
> -Xloggc:log/jvmsolr.log
> OR
> -Xmx8000m -Xms8000m -Xmn2500m -Xss1m -XX:PermSize=128m -XX:MaxPermSize=128m
> -XX:SurvivorRatio=3 -XX:+UseParNewGC -XX:ParallelGCThreads=8
> -XX:+UseConcMarkSweepGC -XX:CMSFullGCsBeforeCompaction=5
> -XX:+UseCMSCompactAtFullCollection -XX:+PrintGC -XX:+PrintGCDateStamps
> -Xloggc:log/jvmsolr.log
>
> Nodes works well, but also switch state every day (at the same time, gc
> becomes abnormal like below).
>
> 2013-08-28T13:29:39.140+0800: 97180.866: [GC 3770296K->2232626K(4608000K),
> 0.0099250 secs]
> 2013-08-28T13:30:09.324+0800: 97211.050: [GC 3765732K->2241711K(4608000K),
> 0.0124890 secs]
> 2013-08-28T13:30:29.777+0800: 97231.504: [GC 3760694K->2736863K(4608000K),
> 0.0695530 secs]
> 2013-08-28T13:31:02.887+0800: 97264.613: [GC 4258337K->4354810K(4608000K),
> 0.1374600 secs]
> 97264.752: [Full GC 4354810K->2599431K(4608000K), 6.7833960 secs]
> 2013-08-28T13:31:09.884+0800: 97271.610: [GC 2750517K(4608000K), 0.0054320
> secs]
> 2013-08-28T13:31:15.354+0800: 97277.080: [GC 3550474K(4608000K), 0.0871270
> secs]
> 2013-08-28T13:31:31.258+0800: 97292.984: [GC 3877223K(4608000K), 0.1551870
> secs]
> 2013-08-28T13:31:34.396+0800: 97296.123: [GC 3877223K(4608000K), 0.1220380
> secs]
> 2013-08-28T13:31:38.102+0800: 97299.828: [GC 3877225K(4608000K), 0.1545500
> secs]
> 2013-08-28T13:31:40.227+0800: 97303.019: [Full GC
> 4174941K->2127315K(4608000K), 6.3435150 secs]
> 2013-08-28T13:31:49.645+0800: 97311.371: [GC 2508466K(4608000K), 0.0355180
> secs]
> 2013-08-28T13:31:57.645+0800: 97319.371: [GC 2967737K(4608000K), 0.0579650
> secs]
>
> even more, sometimes a shard is down(one node is recovering, another is
> down), that is an absolute disaster...
>
> please help me.   any advice is welcome...
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/why-does-a-node-switch-state-tp4086939.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Reply via email to