Denis Chudov created IGNITE-28420:
-------------------------------------
Summary: Leaseholder balancing: Timeout-based replica switch
Key: IGNITE-28420
URL: https://issues.apache.org/jira/browse/IGNITE-28420
Project: Ignite
Issue Type: Improvement
Components: placement driver ai3
Reporter: Denis Chudov
This improves on the ungraceful solution by allowing to trade transaction
failures for a latency spike.
The algorithm is:
# The user invokes partitions rebalance-primaries --wait-lease
--extra-wait-time 30sec.
# Placement driver identifies all partitions P that need to be rebalanced from
their current replica leases L.
# Placement driver marks L.isCondemned = true, then sleeps for extraWaitTime.
# isCondemned signals to txns that the current is will soon expire. On txn
coordinators, awaitPrimaryReplica treats isCondemned == true as if the lease
didn't exist, i.e. it waits for the next lease to start.
# After extraWaitTime passes, the placement driver tells the node
L.leaseholderId to give up the lease at the end of the current term, and not to
attempt to be elected in the next term.
# On the next election, a new primary is elected and the lease is saved
normally, with isCondemned = false. Awaiting txns all proceed.
The trade-off of this solution vs ungraceful is that we get a latency spike on
new txns but potentially avoid txn failures:
* New txns see latency spikes of up to extraWaitTime + leaseExpirationInterval.
* Existing txns that don't finish before extraWaitTime +
leaseExpirationInterval still fail.
The advantage here is that the user is in control - they know their system and
can decide if they can handle failed txns, latency spikes, how long do their
txns take, etc. If all their txns are shorter than leaseExpirationInterval
(very common) then rebalance-primaries --wait-lease allows to have no failures
and just up to leaseExpirationInterval latency spike - which is already
possible in other failure scenarios. Note also that if the primary is now
overloaded, the user is already risking or even experiencing latency spikes
from the overload.
Note that there may be a more clever management of the wait time than proposed,
especially for cases when all txns are sub-second - in that case we may only
need to condemn the lease for the its last fraction.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)