Can you file an issue and attach your logs?

You might also try the 4.0 release to see if the problem was fixed after the beta.

- Mark

On 10/14/2012 08:48 AM, Jam Luo wrote:
Yes, I have the same problem.

2012/10/5 Kyryl Bilokurov <kyryl.biloku...@gmail.com>

Hi,

I have a functional/performance test SolrCloud cluster (using Solr
4.0-BETA) with the following setup: 4 servers, each server hosts 1/4th of
the collection (no replicas, so there are only leaders for each shard).
Current ZK client timeout is set to 15 seconds. From time to time I see
that Solr's ZK client connection gets timed out:

======
INFO: Client session timed out, have not heard from server in 19105ms for
sessionid 0x3388fcec9490677, closing socket connection and attempting
reconnect
======

The reconnect is triggered, but after the reconnect, shard enters into the
bad state, as it cannot get the leader props for the extended period of
time:

======
INFO: Updating cluster state from ZooKeeper...
Oct 3, 2012 4:07:20 AM org.apache.solr.common.cloud.ZkStateReader$2 process
INFO: A cluster state change has occurred - updating...
Oct 3, 2012 4:07:50 AM org.apache.solr.common.SolrException log
SEVERE: There was a problem finding the leader in
zk:java.lang.RuntimeException: Could not get leader props
         at
org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:640)
         at

org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1031)
         at

org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:233)
         at
org.apache.solr.cloud.ZkController.access$300(ZkController.java:77)
         at
org.apache.solr.cloud.ZkController$1.command(ZkController.java:180)
         at

org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:101)
         at

org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:47)
         at

org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:85)
         at

org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
         at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)
....
...the same message&stacktrace repeats every ~30 seconds, until it changes
to
...
Oct 3, 2012 4:20:09 AM org.apache.solr.common.SolrException log
SEVERE: :org.apache.solr.common.SolrException: There was a problem finding
the leader in zk
         at

org.apache.solr.cloud.ZkController.waitForLeaderToSeeDownState(ZkController.java:1041)
         at

org.apache.solr.cloud.ZkController.registerAllCoresAsDown(ZkController.java:233)
         at
org.apache.solr.cloud.ZkController.access$300(ZkController.java:77)
         at
org.apache.solr.cloud.ZkController$1.command(ZkController.java:180)
         at

org.apache.solr.common.cloud.ConnectionManager$1.update(ConnectionManager.java:101)
         at

org.apache.solr.common.cloud.DefaultConnectionStrategy.reconnect(DefaultConnectionStrategy.java:47)
         at

org.apache.solr.common.cloud.ConnectionManager.process(ConnectionManager.java:85)
         at

org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:526)
         at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:502)

Oct 3, 2012 4:20:09 AM org.apache.solr.cloud.ZkController
createEphemeralLiveNode
INFO: Register node as live in ZooKeeper:/live_nodes/host.domain:18100_solr
Oct 3, 2012 4:20:09 AM org.apache.solr.common.cloud.SolrZkClient makePath
INFO: makePath: /live_nodes/host.domain:18100_solr
...
... at this point cluster seems to be OK for some time.
======

This looks a bit similar to the SOLR-3274, as it is also triggered by the
expired ZK connection, and results in "No servers hosting shard" search
errors.

For now, I have increased the timeout to the 30secs, similar to suggested
in SOLR-3274 to lower down the probability of ZK timeouts, but shouldn't
cluster heal faster than in 15 mins? As there is only one server hosting
each shard, it could become a leader instantly.

Thanks,
Kyryl


Reply via email to