I think it's a worthy problem to address given we (we work at the same company) ran into a production incident due to it. Who's familiar and interested enough in leader election code to help review such changes?
Thanks, Ilan On Mon, Dec 18, 2023 at 5:33 PM Pierre Salagnac <pierre.salag...@gmail.com> wrote: > We recently had a couple of issues with production clusters because of race > conditions in shard leader election. By race condition here, in mean for a > single node. I'm not discussing how leader election is distributed > across multiple Solr nodes, but how multiple threads in a single Solr node > conflict with each other. > > On the overall, when two threads (on the same server) concurrently join > leader election for the same replica, the outcome is unpredictable. it may > end in two nodes thinking they are the leader or not having any leader at > all. > I identified two scenarios, but maybe there are more: > > 1. Zookeeper session expires while an election is already in progress. > When we re-create the Zookeeper session, we re-register all the cores, and > join elections for all of them. If an election is already in-progress or is > triggered for any reason, we can have two threads on the same Solr server > node running leader election for the same core. > > 2. Command REJOINLEADERELECTION is received twice concurrently for the same > core. > This scenario is much easier to reproduce with an external client. It > occurs for us since we have customizations using this command. > > > The code for leader election hasn't changed much for a while, and I don't > understand the full history behind it. I wonder whether multithreading was > already discussed and/or taken into account. The code has a "TODO: can we > even get into this state?" that makes me think this issue was already > reproduced but noy fully solved/understood. > Since this code has many calls to Zookeeper, I don't think we can just > "synchronize" it with mutual exclusions, as these calls that involve the > network can be incredibly slow when something bad happens. We don't want > any thread to be blocked by another waiting for a remote call to complete. > > I would like to get some opinions about making this code more robust to > concurrency. Unless the main opinion is "no, this code should actually be > mono threaded !", I can give it a try. > > Thanks >