[
https://issues.apache.org/jira/browse/SOLR-14462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144259#comment-17144259
]
ASF subversion and git services commented on SOLR-14462:
--------------------------------------------------------
Commit 25428013fb0ed8f8fdbebdef3f1d65dea77129c2 in lucene-solr's branch
refs/heads/master from Ilan Ginzburg
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2542801 ]
SOLR-14462: cache more than one autoscaling session (#1504)
SOLR-14462: cache more than one autoscaling session
> Autoscaling placement wrong with concurrent collection creations
> ----------------------------------------------------------------
>
> Key: SOLR-14462
> URL: https://issues.apache.org/jira/browse/SOLR-14462
> Project: Solr
> Issue Type: Bug
> Components: AutoScaling
> Affects Versions: master (9.0)
> Reporter: Ilan Ginzburg
> Assignee: Noble Paul
> Priority: Major
> Attachments: PolicyHelperNewLogs.txt, policylogs.txt
>
> Time Spent: 2h 40m
> Remaining Estimate: 0h
>
> Under concurrent collection creation, wrong Autoscaling placement decisions
> can lead to severely unbalanced clusters.
> Sequential creation of the same collections is handled correctly and the
> cluster is balanced.
> *TL;DR;* under high load, the way sessions that cache future changes to
> Zookeeper are managed cause placement decisions of multiple concurrent
> Collection API calls to ignore each other, be based on identical “initial”
> cluster state, possibly leading to identical placement decisions and as a
> consequence cluster imbalance.
> *Some context first* for those less familiar with how Autoscaling deals with
> cluster state change: a PolicyHelper.Session is created with a snapshot of
> the Zookeeper cluster state and is used to track already decided but not yet
> persisted to Zookeeper cluster state changes so that Collection API commands
> can make the right placement decisions.
> A Collection API command either uses an existing cached Session (that
> includes changes computed by previous command(s)) or creates a new Session
> initialized from the Zookeeper cluster state (i.e. with only state changes
> already persisted).
> When a Collection API command requires a Session - and one is needed for any
> cluster state update computation - if one exists but is currently in use, the
> command can wait up to 10 seconds. If the session becomes available, it is
> reused. Otherwise, a new one is created.
> The Session lifecycle is as follows: it is created in COMPUTING state by a
> Collection API command and is initialized with a snapshot of cluster state
> from Zookeeper (does not require a Zookeeper read, this is running on
> Overseer that maintains a cache of cluster state). The command has exclusive
> access to the Session and can change the state of the Session. When the
> command is done changing the Session, the Session is “returned” and its state
> changes to EXECUTING while the command continues to run to persist the state
> to Zookeeper and interact with the nodes, but no longer interacts with the
> Session. Another command can then grab a Session in EXECUTING state, change
> its state to COMPUTING to compute new changes taking into account previous
> changes. When all commands having used the session have completed their work,
> the session is “released” and destroyed (at this stage, Zookeeper contains
> all the state changes that were computed using that Session).
> The issue arises when multiple Collection API commands are executed at once.
> A first Session is created and commands start using it one by one. In a
> simple 1 shard 1 replica collection creation test run with 100 parallel
> Collection API requests (see debug logs from PolicyHelper in file
> policy.logs), this Session update phase (Session in COMPUTING status in
> SessionWrapper) takes about 250-300ms (MacBook Pro).
> This means that about 40 commands can run by using in turn the same Session
> (45 in the sample run). The commands that have been waiting for too long time
> out after 10 seconds, more or less all at the same time (at the rate at which
> they have been received by the OverseerCollectionMessageHandler, approx one
> per 100ms in the sample run) and most/all independently decide to create a
> new Session. These new Sessions are based on Zookeeper state, they might or
> might not include some of the changes from the first 40 commands (depending
> on if these commands got their changes written to Zookeeper by the time of
> the 10 seconds timeout, a few might have made it, see below).
> These new Sessions (54 sessions in addition to the initial one) are based on
> more or less the same state, so all remaining commands are making placement
> decisions that do not take into account each other (and likely not much of
> the first 44 placement decisions either).
> The sample run whose relevant logs are attached led for the 100 single shard
> single replica collection creations to 82 collections on the Overseer node,
> and 5 and 13 collections on the two other nodes of a 3 nodes cluster. Given
> that the initial session was used 45 times (once initially then reused 44
> times), one would have expected at least the first 45 collections to be
> evenly distributed, i.e. 15 replicas on each node. This was not the case,
> possibly a sign of other issues (other runs even ended up placing 0 replicas
> out of the 100 on one of the nodes).
> From the client perspective, http admin collection CREATE requests averaged
> 19.5 seconds each and lasted between 7 and 28 seconds (100 parallel threads).
> This is likely an indication that the last 55 collection creations didn’t see
> much of the state updates done by the first 45 creations (client delay is
> longer though than actual Overseer command execution time by http time +
> Collections API Zookeeper queue time) .
> *A possible fix* is to not observe any delay before creating a new Session
> when the currently cached session is busy (i.e. COMPUTING). It will be
> somewhat less optimal in low load cases (this is likely not an issue, future
> creations will compensate for slight unbalance and under optimal placement)
> but will speed up Collection API calls (no waiting) and will prevent multiple
> waiting commands from all creating new Sessions based on an identical
> Zookeeper state in cases such as the one described here. For long (minutes
> and more) autoscaling computations it will likely not make a big difference.
> If we had more than a single Session being cached (and reused), then less
> ongoing updates would be lost.
> Maybe, rather than caching the new updated cluster state after each change,
> the changes themselves (the deltas) should be tracked. This might allow to
> propagate changes between sessions or to reconcile cluster state read from
> Zookeeper with the stream of changes stored in a Session by identifying which
> deltas made it to Zookeeper, which ones are new from Zookeeper (originating
> from an update in another session) and which are still pending.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]