Hello everyone, I'm investigating issues where a replica ends in having no leader, and I wonder whether my specified cases were already discussed somewhere.
More specifically in the code, I (with the help of my colleagues) identified two gaps where we exit the leadership process, without going back to it ever. Both of them happen when the election ephemeral node is dropped because the Zookeeper session expired. First one, in class LeaderElector: - we log *"Our node is no longer in line to be leader"* - and immediately return Second one, in class * - we log "Will not register as leader because it seems the election is no longer taking place."* - and immediately return For both cases, we explicitly check our sequential node still exists in the election. First case has a call to zkClient.getChildren(...) and we then validate the results, while the second case catches a NoNodeException. If I don't miss anything, the node won't get back to this election. Since we aborted, this allows other eventual nodes to be the leader for this shard. But if they're not there (and we are), we just can't be the leader. Taking a step back, it seems to me error handling in the leader election code is messy. There are a large number of catch blocks. Some of them trigger a retry of the election while some of them don't. Are they issues that were already discussed ? Thanks