We recently had a couple of issues with production clusters because of race conditions in shard leader election. By race condition here, in mean for a single node. I'm not discussing how leader election is distributed across multiple Solr nodes, but how multiple threads in a single Solr node conflict with each other.
On the overall, when two threads (on the same server) concurrently join leader election for the same replica, the outcome is unpredictable. it may end in two nodes thinking they are the leader or not having any leader at all. I identified two scenarios, but maybe there are more: 1. Zookeeper session expires while an election is already in progress. When we re-create the Zookeeper session, we re-register all the cores, and join elections for all of them. If an election is already in-progress or is triggered for any reason, we can have two threads on the same Solr server node running leader election for the same core. 2. Command REJOINLEADERELECTION is received twice concurrently for the same core. This scenario is much easier to reproduce with an external client. It occurs for us since we have customizations using this command. The code for leader election hasn't changed much for a while, and I don't understand the full history behind it. I wonder whether multithreading was already discussed and/or taken into account. The code has a "TODO: can we even get into this state?" that makes me think this issue was already reproduced but noy fully solved/understood. Since this code has many calls to Zookeeper, I don't think we can just "synchronize" it with mutual exclusions, as these calls that involve the network can be incredibly slow when something bad happens. We don't want any thread to be blocked by another waiting for a remote call to complete. I would like to get some opinions about making this code more robust to concurrency. Unless the main opinion is "no, this code should actually be mono threaded !", I can give it a try. Thanks