SolrCloud timing out marking node as down during startup.

Michael Roberts Thu, 22 Jan 2015 10:45:43 -0800

Hi,

I'm seeing some odd behavior that I am hoping someone could explain to me.


The configuration I'm using to repro the issue, has a ZK cluster and a single 
Solr instance. The instance has 10 Cores, and none of the cores are sharded.

The initial startup is fine, the Solr instance comes up and we build our index. 
However if the Solr instance exits uncleanly (killed rather than sent a 
SIGINT), the next time it starts I see the following in the logs.

2015-01-22 09:56:23.236 -0800 (,,,) localhost-startStop-1 : INFO  
org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from 
ZooKeeper...
2015-01-22 09:56:30.008 -0800 (,,,) localhost-startStop-1-EventThread : DEBUG 
org.apache.solr.common.cloud.SolrZkClient - Submitting job to respond to event 
WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/live_nodes
2015-01-22 09:56:30.008 -0800 (,,,) zkCallback-2-thread-1 : DEBUG 
org.apache.solr.common.cloud.ZkStateReader - Updating live nodes... (0)
2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : WARN  
org.apache.solr.cloud.ZkController - Timed out waiting to see all nodes 
published as DOWN in our cluster state.
2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : INFO  
org.apache.solr.cloud.ZkController - Register node as live in 
ZooKeeper:/live_nodes/10.18.8.113:11000_solr
My question is about "Timed out waiting to see all nodes published as DOWN in 
our cluster state."

Cursory look at the code, we seem to iterate through all Collections/Shards, 
and mark the state as Down. These notifications are offered to the Overseer, 
who I believe updates the ZK state. We then wait for the ZK state to update, 
with the 60 second timeout.

However, it looks like the Overseer is not started until after we wait for the 
timeout. So, in a single instance scenario we'll always have to wait for the 
timeout.

Is this the expected behavior (and just a side effect of running a single 
instance in cloud mode), or is my understanding of the Overseer/Zk relationhip 
incorrect?

Thanks.

.Mike

SolrCloud timing out marking node as down during startup.

Reply via email to