Hello everyone,
I'm investigating issues where a replica ends in having no leader, and I
wonder whether my specified cases were already discussed somewhere.

More specifically in the code, I (with the help of my colleagues)
identified two gaps where we exit the leadership process, without going
back to it ever. Both of them happen when the election ephemeral node is
dropped because the Zookeeper session expired.

First one, in class LeaderElector:
- we log *"Our node is no longer in line to be leader"*
- and immediately return

Second one, in class
* - we log "Will not register as leader because it seems the election is no
longer taking place."*
 - and immediately return

For both cases, we explicitly check our sequential node still exists in the
election. First case has a call to zkClient.getChildren(...) and we then
validate the results, while the second case catches a NoNodeException.
If I don't miss anything, the node won't get back to this election. Since
we aborted, this allows other eventual nodes to be the leader for this
shard. But if they're not there (and we are), we just can't be the leader.


Taking a step back, it seems to me error handling in the leader election
code is messy. There are a large number of catch blocks. Some of them
trigger a retry of the election while some of them don't.

Are they issues that were already discussed ?
Thanks

Reply via email to