The message by Pierre is regarding fixing existing code. The leader on demand doesn't seem to be a short term solution in any case, and there wasn't really a consensus around the proposal.
Ilan On Tue, Dec 19, 2023 at 4:16 PM David Smiley <dsmi...@apache.org> wrote: > I would be more in favor of going back to the drawing board on leader > election than incremental improvements. Go back to first principles. The > clarity just isn't there to be maintained. I don't trust it. > > Coincidentally I sent a message to the Apache Curator users list yesterday > to inquire about leader prioritization: > https://lists.apache.org/thread/lmm30qpm17cjf4b93jxv0rt3bq99c0sb > I suspect the "users" list is too low activity to be useful for the Curator > project; I'm going to try elsewhere. > > For shards, there doesn't even need to be a "leader election" recipe > because there are no shard leader threads that always need to be > thinking/doing stuff, unlike the Overseer. It could be more demand-driven > (assign leader on-demand if needs to be re-assigned), and thus be more > scalable as well for many shards. > Some of my ideas on this: > https://lists.apache.org/thread/kowcp2ftc132pq0y38g9736m0slchjg7 > > On Mon, Dec 18, 2023 at 11:33 AM Pierre Salagnac < > pierre.salag...@gmail.com> > wrote: > > > We recently had a couple of issues with production clusters because of > race > > conditions in shard leader election. By race condition here, in mean for > a > > single node. I'm not discussing how leader election is distributed > > across multiple Solr nodes, but how multiple threads in a single Solr > node > > conflict with each other. > > > > On the overall, when two threads (on the same server) concurrently join > > leader election for the same replica, the outcome is unpredictable. it > may > > end in two nodes thinking they are the leader or not having any leader at > > all. > > I identified two scenarios, but maybe there are more: > > > > 1. Zookeeper session expires while an election is already in progress. > > When we re-create the Zookeeper session, we re-register all the cores, > and > > join elections for all of them. If an election is already in-progress or > is > > triggered for any reason, we can have two threads on the same Solr server > > node running leader election for the same core. > > > > 2. Command REJOINLEADERELECTION is received twice concurrently for the > same > > core. > > This scenario is much easier to reproduce with an external client. It > > occurs for us since we have customizations using this command. > > > > > > The code for leader election hasn't changed much for a while, and I don't > > understand the full history behind it. I wonder whether multithreading > was > > already discussed and/or taken into account. The code has a "TODO: can we > > even get into this state?" that makes me think this issue was already > > reproduced but noy fully solved/understood. > > Since this code has many calls to Zookeeper, I don't think we can just > > "synchronize" it with mutual exclusions, as these calls that involve the > > network can be incredibly slow when something bad happens. We don't want > > any thread to be blocked by another waiting for a remote call to > complete. > > > > I would like to get some opinions about making this code more robust to > > concurrency. Unless the main opinion is "no, this code should actually be > > mono threaded !", I can give it a try. > > > > Thanks > > >