Ilan Ginzburg created SOLR-14462:
------------------------------------

             Summary: Autoscaling placement wrong with concurrent collection 
creations
                 Key: SOLR-14462
                 URL: https://issues.apache.org/jira/browse/SOLR-14462
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: AutoScaling
    Affects Versions: master (9.0)
            Reporter: Ilan Ginzburg
         Attachments: policylogs.txt

Under concurrent collection creation, wrong Autoscaling placement decisions can 
lead to severely unbalanced clusters.
 Sequential creation of the same collections is handled correctly and the 
cluster is balanced.

*TL;DR;* under high load, the way sessions that cache future changes to 
Zookeeper are managed cause placement decisions of multiple concurrent 
Collection API calls to ignore each other, be based on identical “initial” 
cluster state, possibly leading to identical placement decisions and as a 
consequence cluster imbalance.

*Some context first* for those less familiar with how Autoscaling deals with 
cluster state change: a PolicyHelper.Session is created with a snapshot of the 
Zookeeper cluster state and is used to track already decided but not yet 
persisted to Zookeeper cluster state changes so that Collection API commands 
can make the right placement decisions.
 A Collection API command either uses an existing cached Session (that includes 
changes computed by previous command(s)) or creates a new Session initialized 
from the Zookeeper cluster state (i.e. with only state changes already 
persisted).
 When a Collection API command requires a Session - and one is needed for any 
cluster state update computation - if one exists but is currently in use, the 
command can wait up to 10 seconds. If the session becomes available, it is 
reused. Otherwise, a new one is created.

The Session lifecycle is as follows: it is created in COMPUTING state by a 
Collection API command and is initialized with a snapshot of cluster state from 
Zookeeper (does not require a Zookeeper read, this is running on Overseer that 
maintains a cache of cluster state). The command has exclusive access to the 
Session and can change the state of the Session. When the command is done 
changing the Session, the Session is “returned” and its state changes to 
EXECUTING while the command continues to run to persist the state to Zookeeper 
and interact with the nodes, but no longer interacts with the Session. Another 
command can then grab a Session in EXECUTING state, change its state to 
COMPUTING to compute new changes taking into account previous changes. When all 
commands having used the session have completed their work, the session is 
“released” and destroyed (at this stage, Zookeeper contains all the state 
changes that were computed using that Session).

The issue arises when multiple Collection API commands are executed at once. A 
first Session is created and commands start using it one by one. In a simple 1 
shard 1 replica collection creation test run with 100 parallel Collection API 
requests (see debug logs from PolicyHelper in file policy.logs), this Session 
update phase (Session in COMPUTING status in SessionWrapper) takes about 
250-300ms (MacBook Pro).

This means that about 40 commands can run by using in turn the same Session (45 
in the sample run). The commands that have been waiting for too long time out 
after 10 seconds, more or less all at the same time (at the rate at which they 
have been received by the OverseerCollectionMessageHandler, approx one per 
100ms in the sample run) and most/all independently decide to create a new 
Session. These new Sessions are based on Zookeeper state, they might or might 
not include some of the changes from the first 40 commands (depending on if 
these commands got their changes written to Zookeeper by the time of the 10 
seconds timeout, a few might have made it, see below).

These new Sessions (54 sessions in addition to the initial one) are based on 
more or less the same state, so all remaining commands are making placement 
decisions that do not take into account each other (and likely not much of the 
first 44 placement decisions either).

The sample run whose relevant logs are attached led for the 100 single shard 
single replica collection creations to 82 collections on the Overseer node, and 
5 and 13 collections on the two other nodes of a 3 nodes cluster. Given that 
the initial session was used 45 times (once initially then reused 44 times), 
one would have expected at least the first 45 collections to be evenly 
distributed, i.e. 15 replicas on each node. This was not the case, possibly a 
sign of other issues (other runs even ended up placing 0 replicas out of the 
100 on one of the nodes).

>From the client perspective, http admin collection CREATE requests averaged 
>19.5 seconds each and lasted between 7 and 28 seconds (100 parallel threads). 
>This is likely an indication that the last 55 collection creations didn’t see 
>much of the state updates done by the first 45 creations (client delay is 
>longer though than actual Overseer command execution time by http time + 
>Collections API Zookeeper queue time) .

*A possible fix* is to not observe any delay before creating a new Session when 
the currently cached session is busy (i.e. COMPUTING). It will be somewhat less 
optimal in low load cases (this is likely not an issue, future creations will 
compensate for slight unbalance and under optimal placement) but will speed up 
Collection API calls (no waiting) and will prevent multiple waiting commands 
from all creating new Sessions based on an identical Zookeeper state in cases 
such as the one described here. For long (minutes and more) autoscaling 
computations it will likely not make a big difference.

If we had more than a single Session being cached (and reused), then less 
ongoing updates would be lost.
 Maybe, rather than caching the new updated cluster state after each change, 
the changes themselves (the deltas) should be tracked. This might allow to 
propagate changes between sessions or to reconcile cluster state read from 
Zookeeper with the stream of changes stored in a Session by identifying which 
deltas made it to Zookeeper, which ones are new from Zookeeper (originating 
from an update in another session) and which are still pending.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to