Re: JIT Shard leader design/proposal

David Smiley Thu, 13 Oct 2022 22:09:08 -0700

On Thu, Oct 13, 2022 at 10:28 AM Bruno Roustant <bruno.roust...@gmail.com>
wrote:


> I don't know enough how the current leader election mechanism works, yet.
> I miss a comparison between this proposal and the current mechanism.
>
> B. With this proposal, each time we need the leader, we check it. What is
> the cost of this check? Do we need to read the cluster state each time?
>

DocCollection and friends are updated mainly via ZK watchers (i.e. async;
not in the critical path).  I proposed below that client/server could pass
ZK version hints to let either side know that it ought to explicitly go to
ZK when the states differ.  Regardless; "each time" (an update comes), we
need not get the state from ZK. The state could be stale (as it can be
today!).  It will interact with other replicas (either as a leader or
non-leader) and discover it's acting on stale information, *then* update
itself.


> C.1.A "Strict use of ZkShardTerms". Does that mean that we wait for one
> replica to become up to date?


Yes.


> Does that mechanism already exist?
>

Yes, certainly.  Replicas sync with the leader in order to become
state=ACTIVE.  Thanks to ZkShardTerms, they know if they are caught up, and
are thus already leader-eligible.  I see extra complexity around leadership
that perhaps pre-dated ZkShardTerms.


> In the meantime, if the previous leader comes back, we can choose it.
>

Yes; I think this is rather typical; maybe the leader's node simply
restarted.

~ David

Le jeu. 13 oct. 2022 à 16:24, David Smiley <dsmi...@apache.org> a écrit :
>
> > "JIT" is Just-In-Time; a way of looking at it.  Or call it on-demand
> leader
> > elections.
> >
> > A property of my proposal that may not be obvious is that
> REBALANCELEADERS
> > would be needless since preferredLeaders would become leaders
> automatically
> > on-demand (when a leader is next needed).
> >
> > It's not clear how amenable Curator's election recipe is to
> preferredLeader
> > and ZkShardTerms requirements; so using Curator for elections might
> simply
> > not be an option?
> >
> > Here's another thematic concept that is tangentially related; maybe it's
> > worth its own discussion:  Solr's view of state from ZK could become
> stale
> > at any time.  Embrace sharing of ZK node versions between client and
> server
> > to trigger the need for either side to update its state from ZK.
> Watchers
> > are nice but we shouldn't rely on them for correctness.  Put differently,
> > SolrCloud should function (albeit perhaps slowly) if its use of ZK
> Watchers
> > stopped working.  Today, CloudSolrClient passes _stateVer_ and it's
> parsed
> > on the server by
> > org.apache.solr.servlet.HttpSolrCall#checkStateVersionsAreValid.  This
> > should be used more pervasively in most interactions, not merely a subset
> > of uses of CloudSolrClient.  I think it might only be used to tell the
> > client to get more up to date but ideally it'd be bidirectional.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Tue, Oct 11, 2022 at 6:34 PM David Smiley <dsmi...@apache.org> wrote:
> >
> > > At work, I’ve attempted to troubleshoot issues relating to shard
> > > leadership.  It’s quite possible that the root-causes may be related to
> > > customizations in my fork of Solr; who knows.  The leadership
> > > code/algorithm is so hard to debug/troubleshoot that it’s hard to say.
> > > It’s no secret that Solr’s code here is a complicated puzzle[1].  Out
> of
> > > this frustration, I began to ponder a fantasy of how I want leader
> > election
> > > to work, informed by my desire to scale to massive numbers of
> > collections &
> > > shards on a cluster.  Using Curator for elections would perhaps address
> > > stability but not this scale.  I’d like to get input from you all on
> this
> > > fantasy.  Surely I have overlooked things; please offer your insights!
> > >
> > > Thematic concept:  Don’t change/elect leaders until it’s actually
> > > necessary.  In most cases where I work, the leader will return before
> we
> > > truly need a leader.  Even when not true, I don’t think doing it lazily
> > > should be a noticeable issue?  If so, it’s easy to imagine augmenting
> > this
> > > design with an optional eager leadership election.
> > >
> > > A. Only code paths that truly need a leader will do “leadership
> checks”,
> > > resulting in a potential leader election.  This is principally on
> > indexing
> > > in DistributedZkUpdateProcessor but there are likely more spots.
> > >
> > > B. Leader check: Check if the shard’s leader is (a) known, and (b)
> > > state=ACTIVE, and (c) on a “live” node, and (d) the preferredLeader
> > > condition is satisfied.  Otherwise, try to elect a leader in a loop
> until
> > > this set of conditions is achieved or a timeout is reached.
> > > B.A: The preferredLeader condition means that either the leader is
> marked
> > > as preferredLeader, or no replica with preferredLeader is eligible for
> > > leadership.
> > >
> > > C. “Try to elect a leader”:   (The word “election” might not be the
> best
> > > word for this algorithm, but whatever).
> > > C.1.: A replica must be eligible to be a leader.  It must be live (on a
> > > live node) and have an ACTIVE state.  And, very important, eligibility
> > > should be governed by ZkShardTerms which knows which replicas have the
> > most
> > > up-to-date state.
> > > C.1.A: Strict use of ZkShardTerms is designed to ensure that there is
> no
> > > data loss.  That said “forceLeader” remains in the toolbox of Solr
> admins
> > > (which monkey’s with ZkShardTerms to cheat).  We may need a new
> optional
> > > mechanisms to be closer to what we have today — to basically ignore
> > > ZkShardTerms after a configured period of time?
> > > C.1.B. I am assuming that replicas will become eligible on their own
> > (e.g.
> > > as nodes re-join) instead of this algorithm needing to initiate/tell
> any
> > to
> > > get into this state somehow.
> > > C.2: If there are no leader-eligible replicas, complain with useful
> > > information to diagnose why no leader was found.  Don’t log this if we
> > > already logged this same message in our leadership check loop.  Sleep
> > > perhaps 1000ms and try the loop again.  If we can wait/monitor on the
> > state
> > > of something convenient then do that to avoid sleeping for too long.
> > > C.3: Of the leader-eligible replicas — pick whichever one as the leader
> > > (e.g. random).  Prefer preferredLeader=true replicas, of course.  ZK
> will
> > > solve races if this algorithm runs on more than one node.
> > >
> > > D. Only track leadership in Slice (i.e. within the cluster state) which
> > is
> > > backed by one spot in ZK.  Don’t put it in places like CloudDescriptor
> or
> > > other places in ZK.
> > >
> > > Thoughts?
> > >
> > >
> > > [1]
> > >
> >
> https://lists.apache.org/list?dev@solr.apache.org:2021-10:MILLER%20leader
> > > “ZkCmdExecutor” thread with Mark Miller, and referencing
> > > https://www.solrdev.io/leader-election-adventure.html which no longer
> > > resolves
> > >
> > >
> > > ~ David Smiley
> > > Apache Lucene/Solr Search Developer
> > > http://www.linkedin.com/in/davidwsmiley
> > >
> >
>

Re: JIT Shard leader design/proposal

Reply via email to