What version of solr is having this problem? On Tue, Mar 28, 2023 at 10:47 AM Pierre Salagnac <pierre.salag...@gmail.com> wrote:
> Hello everyone, > I'm investigating issues where a replica ends in having no leader, and I > wonder whether my specified cases were already discussed somewhere. > > More specifically in the code, I (with the help of my colleagues) > identified two gaps where we exit the leadership process, without going > back to it ever. Both of them happen when the election ephemeral node is > dropped because the Zookeeper session expired. > > First one, in class LeaderElector: > - we log *"Our node is no longer in line to be leader"* > - and immediately return > > Second one, in class > * - we log "Will not register as leader because it seems the election is no > longer taking place."* > - and immediately return > > For both cases, we explicitly check our sequential node still exists in the > election. First case has a call to zkClient.getChildren(...) and we then > validate the results, while the second case catches a NoNodeException. > If I don't miss anything, the node won't get back to this election. Since > we aborted, this allows other eventual nodes to be the leader for this > shard. But if they're not there (and we are), we just can't be the leader. > > > Taking a step back, it seems to me error handling in the leader election > code is messy. There are a large number of catch blocks. Some of them > trigger a retry of the election while some of them don't. > > Are they issues that were already discussed ? > Thanks > -- http://www.needhamsoftware.com (work) http://www.the111shift.com (play)