Hey, I am encountering an issue which looks a lot like https://issues.apache.org/jira/browse/SOLR-6763.
However, it seems like the fix for that does not address the entire problem. That fix will only work if we hit the zkClient.getChildren() call before the reconnect logic has finished reconnecting us to ZooKeeper (I can reproduce scenarios where it doesn’t in 4.10.4). If the reconnect has already happened, we won’t get the session timeout exception. The specific problem I am seeing is slightly different SOLR-6763, but the root cause appears to be the same. The issue that I am seeing is; during startup the collections are registered and there is one coreZkRegister-1-thread-* per collection. The elections are started on this thread, the /collections/<name>/leader_elect ZNodes are created, and then the thread blocks waiting for the peers to become available. During the block the ZooKeeper session times out. Once we finish blocking, the reconnect logic calls register() for each collection, which restarts the election process (although serially this time). At a later point, we can have two threads that are trying to register the same collection. This is incorrect, because the coreZkRegister-1-thread-’s are assuming they are leader with no verification from zookeeper. The ephemeral leader_elect nodes they created were removed when the session timed out. If another host started in the interim (or any point after that actually), it would see no leader, and would attempt to become leader of the shard itself. This leads to some interesting race conditions, where you can end up with two leaders for a shard. It seems like a more complete fix would be to actually close the ElectionContext upon reconnect. This would break us out of the wait for peers loop, and stop the threads from processing the rest of the leadership logic. The reconnection logic would then continue to call register() again for each Collection, and if the ZK state indicates it should be leader it can re-run the leadership logic. I have a patch in testing that does this, and I think addresses the problem. What is the general process for this? I didn’t want to reopen a close Jira item. Should I create a new one so the issue and the proposed fix can be discussed? Thanks. Mike.