[
https://issues.apache.org/jira/browse/RATIS-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071915#comment-18071915
]
Ivan Andika commented on RATIS-2500:
------------------------------------
This seems to be triggered by the changes in RATIS-2487, we fixed the NPE, but
another edge case can cause infinite installSnapshot where the leader sends the
same InstallSnapshot request and the follower returns ALREADY_INSTALLED without
doing anything.
Below is the generalized problem of the edge case. I'm not sure whether
comparing the snapshot term and the leader's first available log is actually
the correct way.
{code:java}
Generalized Infinite Snapshot Loop
Variables
┌──────────┬─────────────────────────────────────────────────────────────┬──────────────────┐
│ Variable │ Description │
Example from log │
├──────────┼─────────────────────────────────────────────────────────────┼──────────────────┤
│ C │ Index of the initial CONFIGURATIONENTRY (always 0 in Ratis) │ 0
│
│ t1 │ Term of the old leader (omNode-1) │ 1
│
│ t2 │ Term of the new leader (omNode-3), where t2 > t1 │ 2
│
│ S_f │ Follower's snapshot index at restart (taken during t1) │ 0
│
│ S_l │ Leader's latest snapshot index, where S_l > C │
150 │
│ N │ Leader's next index (end of log), where N >> S_l │
179 │
└──────────┴─────────────────────────────────────────────────────────────┴──────────────────┘
Preconditions
1. The old leader wrote a CONFIGURATIONENTRY at index C in term t1.
2. All nodes replicated it. The follower took a snapshot covering index C (so
S_f = C, snapshot term = t1).
3. The old leader shuts down. A new leader is elected at term t2 > t1.
4. The new leader rolls the log segment at index C into a closed segment, and
opens a new segment starting at C+1. All new entries from C+1 onwards are
written in term t2.
5. The leader takes snapshots and purges. The closed segment log_C-C is
purged. The open segment starting at C+1 is never purged (purge only operates
on closed segments).
6. The follower restarts and rejoins.
Leader's log state when follower rejoins
closedSegments = [] (log_C-C was purged)
openSegment = log_inprogress_{C+1} (indices C+1 .. N-1, all term t2)
getStartIndex() = C+1 (from openSegment, since closedSegments is empty)
getNextIndex() = N
getTermIndex(C) = null (purged)
getTermIndex(C+1)= (t2, C+1) (exists in open segment)
snapshotIndex = S_l
Follower's state after restart
snapshotIndex = S_f = C (from term t1)
snapshotTerm = t1
nextIndex = C+1 (set by leader after INCONSISTENCY reply)
The Loop
┌─────────────────────────────────────────────────────────────────────────────┐
│ Leader LogAppender run() loop
│
│
│
│ ① shouldSendAppendEntries()?
│
│ followerNextIndex(C+1) < leaderNextIndex(N) → true
│
│
│
│ ② shouldInstallSnapshot(true)?
│
│ followerNextIndex(C+1) == leaderStartIndex(C+1) ✓
│
│ followerNextIndex(C+1) > LEAST_VALID_LOG_INDEX(0) ✓
│
│ getPrevious(C+1):
│
│ previousIndex = C
│
│ getTermIndex(C) → null (index C was purged)
│
│ snapshot.getTermIndex().getIndex() = S_l ≠ C
│
│ → null
│
│ → true
│
│
│
│ ③ shouldNotifyToInstallSnapshot()
│
│ getTermIndex(startIndex = C+1) → (t2, C+1) (exists in open segment) │
│ → return firstAvailable = (t2, C+1)
│
│
│
│ ④ notifyInstallSnapshot(firstAvailable = (t2, C+1))
│
│ → sends request to follower, blocks on waitForResponse()
│
│
│
│ ⑤ Follower receives notification:
│
│ firstAvailableLogIndex = C+1
│
│ snapshotIndex = S_f = C
│
│ OLD CHECK (index-only):
│
│ S_f ≠ INVALID → C ≠ -1 ✓ │
│ S_f + 1 ≥ C+1 → C + 1 ≥ C + 1 ✓ │
│ C+1 > INVALID → C + 1 > -1 ✓
│
│ → ALREADY_INSTALLED (snapshotIndex = S_f = C) ◄── BUG
│
│ (ignores that snapshotTerm t1 < firstAvailableTerm t2)
│
│
│
│ ⑥ Leader handles ALREADY_INSTALLED:
│
│ increaseNextIndex(S_f) → newNextIndex = S_f + 1 = C + 1 │
│ updateToMax(C+1) → NO-OP (nextIndex is already C+1) │
│
│
│ ⑦ followerNextIndex = C+1 (unchanged) → LOOP BACK TO ① │
│
│
└─────────────────────────────────────────────────────────────────────────────┘
Why it's an infinite loop
The loop is self-sustaining because the invariant never changes:
followerNextIndex = C + 1 (set once, never advances)
leaderStartIndex = C + 1 (open segment, never purged)
getTermIndex(C) = null (closed segment was purged, permanent)
S_f = C (follower never installs a new snapshot)
Every iteration:
• shouldInstallSnapshot → true (because getPrevious(C+1) is always null)
• firstAvailable → (t2, C+1) (always the same, from the open segment)
• Follower replies ALREADY_INSTALLED, snapshotIndex=C (index-only check
always passes when S_f = C and firstAvailableLogIndex = C+1, since C + 1 >= C +
1)
• Leader sets nextIndex = C + 1 (no-op)
The fix
The fix adds a term comparison to the follower's isSnapshotAlreadyInstalled
check:
OLD: S_f + 1 >= C+1 AND S_f ≠ INVALID AND C+1 > INVALID
→ true (loop continues forever)
NEW: S_f + 1 >= C+1 AND S_f ≠ INVALID AND C+1 > INVALID AND
snapshotTerm >= firstAvailableTerm
→ t1 >= t2 → false (loop breaks, follower proceeds to install snapshot)
The additional condition snapshotTerm(t1) >= firstAvailableTerm(t2) fails
because t1 < t2, so the follower correctly recognizes its snapshot is stale and
proceeds with the actual snapshot installation, which
advances followerNextIndex past C+1 and breaks the loop.
{code}
> Infinite loop in InstallSnapshot notification mechanism
> -------------------------------------------------------
>
> Key: RATIS-2500
> URL: https://issues.apache.org/jira/browse/RATIS-2500
> Project: Ratis
> Issue Type: Bug
> Reporter: Ivan Andika
> Assignee: Ivan Andika
> Priority: Major
> Attachments: infinite-snapshot.log
>
>
> Found in a test where cluster is stuck in infinite snapshot loop where the
> leader keeps sending install snapshot notifications, and the follower keeps
> replying ALREADY_INSTALLED. Found a possible reason with the help of LLM.
> The timeout test log [^infinite-snapshot.log]
> Problematic code in
> SnapshotInstallationHandler#notifyStateMachineToInstallSnapshot
> {code:java}
> if (snapshotIndex != INVALID_LOG_INDEX && snapshotIndex + 1 >=
> firstAvailableLogIndex &&
> firstAvailableLogIndex > INVALID_LOG_INDEX) {
> // State Machine has already installed the snapshot. Return the
> // latest snapshot index to the Leader.
> inProgressInstallSnapshotIndex.compareAndSet(firstAvailableLogIndex,
> INVALID_LOG_INDEX);
> LOG.info("{}: InstallSnapshot notification result: {}, current snapshot
> index: {}", getMemberId(),
> InstallSnapshotResult.ALREADY_INSTALLED, snapshotIndex);
> final InstallSnapshotReplyProto reply =
> toInstallSnapshotReplyProto(leaderId, getMemberId(), currentTerm,
> InstallSnapshotResult.ALREADY_INSTALLED, snapshotIndex);
> return future.thenApply(dummy -> reply);
> } {code}
> The flow
> # Initial state: omNode-1 is leader, processing transactions.
> SNAPSHOT_THRESHOLD = 50 and LOG_PURGE_GAP = 50.
> # `omNode-1` shuts down: Before shutdown, takeSnapshot() is called. The
> snapshotInfo is updated with the current lastAppliedTermIndex. The
> TransactionInfo is written to RocksDB.
> # `omNode-3` becomes leader: The cluster continues processing transactions.
> After 50 more transactions, omNode-3 takes a snapshot and purges old log
> entries. The leader's firstAvailableLogIndex advances.
> # omNode-1` restarts: During initialization:
> • loadSnapshotInfoFromDB() reads TransactionInfo from RocksDB and sets
> snapshotInfo to the last applied index before shutdown (e.g., index 0 or some
> small value from term 1).
> • getLatestSnapshot() returns this snapshotInfo.
> • The RaftLogBase constructor calls getSnapshotIndexFromStateMachine
> which returns snapshotInfo.getIndex() (e.g., 0).
> • Both commitIndex and snapshotIndex in the Raft log are initialized to
> this value (0).
> # Leader sends install snapshot notification: omNode-3 detects omNode-1 is
> behind and sends installSnapshot with firstAvailableLogIndex = 1 (the first
> available log entry in the leader's log after purging).
> # In notifyStateMachineToInstallSnapshot if condition returns true
> (snapshotIndex (0) != INVALID_LOG_INDEX (-1) → true, snapshotIndex + 1 (1)
> >= firstAvailableLogIndex (1) → true, firstAvailableLogIndex (1) >
> INVALID_LOG_INDEX (-1) → true
> # Follower returns already ALREADY_INSTALLED with snapshotIndex = 0. This is
> incorrect because omNode-1's snapshot at index 0 is from term 1, while the
> leader is in term 2 with entries starting at index 1. The follower has NOT
> actually applied the entries that the leader has, it just happens that its
> local snapshot index (0) satisfies the arithmetic condition 0 + 1 >= 1
> The Core Problem
> The ALREADY_INSTALLED check only compares indices without considering
> terms. The follower's snapshotIndex of 0 (from term 1) is treated as
> equivalent to the leader's firstAvailableLogIndex of 1 (from term 2), even
> though they
> represent completely different states. The condition snapshotIndex + 1 >=
> firstAvailableLogIndex is a heuristic that assumes if the follower's snapshot
> is at or past the leader's first available log entry, the follower doesn't
> need
> the snapshot. But this breaks when:
> * The follower's snapshot is from a previous term
> * The leader's log has been purged to a very low index (like 1) in a new term
> * The follower missed all entries in the new term
> One possible fix is to also take into account the snapshot term when
> comparing snapshot, instead of only snapshot index.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)