I think it's a worthy problem to address given we (we work at the same
company) ran into a production incident due to it.
Who's familiar and interested enough in leader election code to help review
such changes?

Thanks,
Ilan

On Mon, Dec 18, 2023 at 5:33 PM Pierre Salagnac <pierre.salag...@gmail.com>
wrote:

> We recently had a couple of issues with production clusters because of race
> conditions in shard leader election. By race condition here, in mean for a
> single node. I'm not discussing how leader election is distributed
> across multiple Solr nodes, but how multiple threads in a single Solr node
> conflict with each other.
>
> On the overall, when two threads (on the same server) concurrently join
> leader election for the same replica, the outcome is unpredictable. it may
> end in two nodes thinking they are the leader or not having any leader at
> all.
> I identified two scenarios, but maybe there are more:
>
> 1. Zookeeper session expires while an election is already in progress.
> When we re-create the Zookeeper session, we re-register all the cores, and
> join elections for all of them. If an election is already in-progress or is
> triggered for any reason, we can have two threads on the same Solr server
> node running leader election for the same core.
>
> 2. Command REJOINLEADERELECTION is received twice concurrently for the same
> core.
> This scenario is much easier to reproduce with an external client. It
> occurs for us since we have customizations using this command.
>
>
> The code for leader election hasn't changed much for a while, and I don't
> understand the full history behind it. I wonder whether multithreading was
> already discussed and/or taken into account. The code has a "TODO: can we
> even get into this state?" that makes me think this issue was already
> reproduced but noy fully solved/understood.
> Since this code has many calls to Zookeeper, I don't think we can just
> "synchronize" it with mutual exclusions, as these calls that involve the
> network can be incredibly slow when something bad happens. We don't want
> any thread to be blocked by another waiting for a remote call to complete.
>
> I would like to get some opinions about making this code more robust to
> concurrency. Unless the main opinion is "no, this code should actually be
> mono threaded !", I can give it a try.
>
> Thanks
>

Reply via email to