So I think everyone would agree that the leader election logic is messy and
there is lots of room for improvement.

The ultimate goal is to use Apache curator to eventually replace most of
our complex zookeeper logic. However for an annoying reason, that work has
stalled for the past year.

I think everyone would agree that the leader election logic, while usually
good, is often the source of pain for people running/managing Solr.
I think fixing these issues piecemeal is probably the way to go until we
can continue on our long-awaited Curator migration.
Would you mind opening an issue/PR to tackle what you found?
It'll probably be easier to discuss the specifics there.

- Houston

On Tue, Mar 28, 2023 at 10:47 AM Pierre Salagnac <pierre.salag...@gmail.com>
wrote:

> Hello everyone,
> I'm investigating issues where a replica ends in having no leader, and I
> wonder whether my specified cases were already discussed somewhere.
>
> More specifically in the code, I (with the help of my colleagues)
> identified two gaps where we exit the leadership process, without going
> back to it ever. Both of them happen when the election ephemeral node is
> dropped because the Zookeeper session expired.
>
> First one, in class LeaderElector:
> - we log *"Our node is no longer in line to be leader"*
> - and immediately return
>
> Second one, in class
> * - we log "Will not register as leader because it seems the election is no
> longer taking place."*
>  - and immediately return
>
> For both cases, we explicitly check our sequential node still exists in the
> election. First case has a call to zkClient.getChildren(...) and we then
> validate the results, while the second case catches a NoNodeException.
> If I don't miss anything, the node won't get back to this election. Since
> we aborted, this allows other eventual nodes to be the leader for this
> shard. But if they're not there (and we are), we just can't be the leader.
>
>
> Taking a step back, it seems to me error handling in the leader election
> code is messy. There are a large number of catch blocks. Some of them
> trigger a retry of the election while some of them don't.
>
> Are they issues that were already discussed ?
> Thanks
>

Reply via email to