Issue when zookeeper session expires during shard leader election.
Hey, I am encountering an issue which looks a lot like https://issues.apache.org/jira/browse/SOLR-6763. However, it seems like the fix for that does not address the entire problem. That fix will only work if we hit the zkClient.getChildren() call before the reconnect logic has finished reconnecting us to ZooKeeper (I can reproduce scenarios where it doesn’t in 4.10.4). If the reconnect has already happened, we won’t get the session timeout exception. The specific problem I am seeing is slightly different SOLR-6763, but the root cause appears to be the same. The issue that I am seeing is; during startup the collections are registered and there is one coreZkRegister-1-thread-* per collection. The elections are started on this thread, the /collections//leader_elect ZNodes are created, and then the thread blocks waiting for the peers to become available. During the block the ZooKeeper session times out. Once we finish blocking, the reconnect logic calls register() for each collection, which restarts the election process (although serially this time). At a later point, we can have two threads that are trying to register the same collection. This is incorrect, because the coreZkRegister-1-thread-’s are assuming they are leader with no verification from zookeeper. The ephemeral leader_elect nodes they created were removed when the session timed out. If another host started in the interim (or any point after that actually), it would see no leader, and would attempt to become leader of the shard itself. This leads to some interesting race conditions, where you can end up with two leaders for a shard. It seems like a more complete fix would be to actually close the ElectionContext upon reconnect. This would break us out of the wait for peers loop, and stop the threads from processing the rest of the leadership logic. The reconnection logic would then continue to call register() again for each Collection, and if the ZK state indicates it should be leader it can re-run the leadership logic. I have a patch in testing that does this, and I think addresses the problem. What is the general process for this? I didn’t want to reopen a close Jira item. Should I create a new one so the issue and the proposed fix can be discussed? Thanks. Mike.
Sync failure after shard leader election when adding new replica.
Hi, I have a SolrCloud setup, running 4.10.3. The setup consists of several cores, each with a single shard and initially each shard has a single replica (so, basically, one machine). I am using core discovery, and my deployment tools create an empty core on newly provisioned machines. The scenario that I am testing is, Machine 1 is running and writes are occurring from my application to Solr. At some point, I stop Machine 1, and reconfigure my application to add Machine 2. Both machines are then started. What I would expect to happen at this point, is Machine 2 cannot become leader because it is behind compared to Machine 1. Machine 2 would then restore from Machine 1. However, looking at the logs. I am seeing Machine 2 become elected leader and fail the PeerRestore 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue. 2015-05-24 17:20:25.983 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2015-05-24 17:20:25.997 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr START replicas=[http://jchar-1:11000/solr/project/] nUpdates=100 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.update.PeerSync - PeerSync: core=project url=http://10.32.132.64:11000/solr DONE. We have no versions. sync failed. 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 2015-05-24 17:20:25.999 -0700 (,,,) coreZkRegister-1-thread-4 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.32.132.64:11000/solr/project/ shard1 What is the expected behavior here? What’s the best practice for adding a new replica? Should I have the SolrCloud running and do it via the Collections API or can I continue to use core discovery? Thanks.
Peer Sync fails when newly added node is elected leader.
Hi, I am seeing some unexpected behavior when adding a new machine to my cluster. I am running 4.10.3. My setup has multiple collections, each collection has a single shard. I am using core auto discovery on the hosts (my deployment mechanism ensures that the directory structure is created and the core.properties file is in the right place). To add a new machine I have to stop the cluster. If I add a new machine, and start the cluster, if this new machine is elected leader for the shard, peer recovery fails. So, now I have a leader with no content, and replicas with content. Depending on where the read request is sent, I may or may not get the response I am expecting. 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader process for shard shard1 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to continue. 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader - try and sync 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.update.PeerSync - PeerSync: core=domain url=http://10.36.9.70:11000/solr START replicas=[http://mlim:11000/solr/domain/] nUpdates=100 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.update.PeerSync - PeerSync: core=domain url=http://10.36.9.70:11000/solr DONE. We have no versions. sync failed. 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we have no versions - we can't sync in that case - we were active before, so become leader anyway 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: http://10.36.9.70:11000/solr/domain/ shard1 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain baseURL=http://10.36.9.70:11000/solr 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary This seems like a fairly common scenario. So I suspect, either I am doing something incorrectly, or I have an incorrect assumption about how this is supposed to work. Does anyone have any suggestions? Thanks Mike.
Re: Peer Sync fails when newly added node is elected leader.
Thanks, that was the response I was expecting unfortunately. We have to stop the cluster to add a node, because Solr is part of a larger system and we don’t support either partial shutdown, or dynamic addition within the larger system. “it waits for some time to see other nodes but if it finds none then it goes ahead and becomes the leader.” That is not what I am seeing happen though. In my example, I had two machines, A (which had been running previously), B (which was newly added). Both A & B participated in the election, and B was elected. It wasn’t a case of just B was available. It would seem that B shouldn’t be elected when there was a better candidate (A), or that if elected B should ensure it’s caught up to it’s peers before marking itself as active. On 6/4/15, 8:31 PM, "Shalin Shekhar Mangar" wrote: >Why do you stop the cluster while adding a node? This is the reason why >this is happening. When the first node of a solr cluster starts up, it >waits for some time to see other nodes but if it finds none then it goes >ahead and becomes the leader. If other nodes were up and running then peer >sync and replication recovery will make sure that the node with data >becomes the leader. So just keep the cluster running while adding a new >node. > >Also, stop relying on core discovery for setting up a node. At some point >we will stop supporting this feature. Use the collection API to add new >replicas. > >On Fri, Jun 5, 2015 at 5:01 AM, Michael Roberts >wrote: > >> Hi, >> >> I am seeing some unexpected behavior when adding a new machine to my >> cluster. I am running 4.10.3. >> >> My setup has multiple collections, each collection has a single shard. I >> am using core auto discovery on the hosts (my deployment mechanism ensures >> that the directory structure is created and the core.properties file is in >> the right place). >> >> To add a new machine I have to stop the cluster. >> >> If I add a new machine, and start the cluster, if this new machine is >> elected leader for the shard, peer recovery fails. So, now I have a leader >> with no content, and replicas with content. Depending on where the read >> request is sent, I may or may not get the response I am expecting. >> >> 2015-06-04 14:26:09.595 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - Running the leader >> process for shard shard1 >> 2015-06-04 14:26:09.607 -0700 (,,,) coreZkRegister-1-thread-9 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - Waiting until we see >> more replicas up for shard shard1: total=2 found=1 timeoutin=1.14707356E15ms >> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - Enough replicas found to >> continue. >> 2015-06-04 14:26:10.108 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - I may be the new leader >> - try and sync >> 2015-06-04 14:26:10.115 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.update.PeerSync - PeerSync: core=domain url= >> http://10.36.9.70:11000/solr START replicas=[ >> http://mlim:11000/solr/domain/] nUpdates=100 >> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.update.PeerSync - PeerSync: core=domain url= >> http://10.36.9.70:11000/solr DONE. We have no versions. sync failed. >> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - We failed sync, but we >> have no versions - we can't sync in that case - we were active before, so >> become leader anyway >> 2015-06-04 14:26:10.121 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ShardLeaderElectionContext - I am the new leader: >> http://10.36.9.70:11000/solr/domain/ shard1 >> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ZkController - No LogReplay needed for core=domain >> baseURL=http://10.36.9.70:11000/solr >> 2015-06-04 14:26:11.153 -0700 (,,,) coreZkRegister-1-thread-3 : INFO >> org.apache.solr.cloud.ZkController - I am the leader, no recovery necessary >> >> This seems like a fairly common scenario. So I suspect, either I am doing >> something incorrectly, or I have an incorrect assumption about how this is >> supposed to work. >> >> Does anyone have any suggestions? >> >> Thanks >> >> Mike. >> > > > >-- >Regards, >Shalin Shekhar Mangar.
SolrCloud timing out marking node as down during startup.
Hi, I'm seeing some odd behavior that I am hoping someone could explain to me. The configuration I'm using to repro the issue, has a ZK cluster and a single Solr instance. The instance has 10 Cores, and none of the cores are sharded. The initial startup is fine, the Solr instance comes up and we build our index. However if the Solr instance exits uncleanly (killed rather than sent a SIGINT), the next time it starts I see the following in the logs. 2015-01-22 09:56:23.236 -0800 (,,,) localhost-startStop-1 : INFO org.apache.solr.common.cloud.ZkStateReader - Updating cluster state from ZooKeeper... 2015-01-22 09:56:30.008 -0800 (,,,) localhost-startStop-1-EventThread : DEBUG org.apache.solr.common.cloud.SolrZkClient - Submitting job to respond to event WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/live_nodes 2015-01-22 09:56:30.008 -0800 (,,,) zkCallback-2-thread-1 : DEBUG org.apache.solr.common.cloud.ZkStateReader - Updating live nodes... (0) 2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : WARN org.apache.solr.cloud.ZkController - Timed out waiting to see all nodes published as DOWN in our cluster state. 2015-01-22 09:57:24.102 -0800 (,,,) localhost-startStop-1 : INFO org.apache.solr.cloud.ZkController - Register node as live in ZooKeeper:/live_nodes/10.18.8.113:11000_solr My question is about "Timed out waiting to see all nodes published as DOWN in our cluster state." Cursory look at the code, we seem to iterate through all Collections/Shards, and mark the state as Down. These notifications are offered to the Overseer, who I believe updates the ZK state. We then wait for the ZK state to update, with the 60 second timeout. However, it looks like the Overseer is not started until after we wait for the timeout. So, in a single instance scenario we'll always have to wait for the timeout. Is this the expected behavior (and just a side effect of running a single instance in cloud mode), or is my understanding of the Overseer/Zk relationhip incorrect? Thanks. .Mike