We recently had a couple of issues with production clusters because of race
conditions in shard leader election. By race condition here, in mean for a
single node. I'm not discussing how leader election is distributed
across multiple Solr nodes, but how multiple threads in a single Solr node
conflict with each other.

On the overall, when two threads (on the same server) concurrently join
leader election for the same replica, the outcome is unpredictable. it may
end in two nodes thinking they are the leader or not having any leader at
all.
I identified two scenarios, but maybe there are more:

1. Zookeeper session expires while an election is already in progress.
When we re-create the Zookeeper session, we re-register all the cores, and
join elections for all of them. If an election is already in-progress or is
triggered for any reason, we can have two threads on the same Solr server
node running leader election for the same core.

2. Command REJOINLEADERELECTION is received twice concurrently for the same
core.
This scenario is much easier to reproduce with an external client. It
occurs for us since we have customizations using this command.


The code for leader election hasn't changed much for a while, and I don't
understand the full history behind it. I wonder whether multithreading was
already discussed and/or taken into account. The code has a "TODO: can we
even get into this state?" that makes me think this issue was already
reproduced but noy fully solved/understood.
Since this code has many calls to Zookeeper, I don't think we can just
"synchronize" it with mutual exclusions, as these calls that involve the
network can be incredibly slow when something bad happens. We don't want
any thread to be blocked by another waiting for a remote call to complete.

I would like to get some opinions about making this code more robust to
concurrency. Unless the main opinion is "no, this code should actually be
mono threaded !", I can give it a try.

Thanks

Reply via email to