[
https://issues.apache.org/jira/browse/RATIS-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18071329#comment-18071329
]
Ivan Andika commented on RATIS-2498:
------------------------------------
One possible case: a pre-existing race between leader step-down and the
client's sliding window when analyzing testAppendEntriesTimeout.
Here's the timeline:
1. s2 becomes leader (term 1). The test blocks WRITE_STATE_MACHINE_DATA on
both followers (s0 and s1)
— line 436 filter keeps all non-leaders whose ID != leader.
2. Client sends a dummy Watch (seq=1) — sendDummyRequest is enabled by
default (line 123 of
OrderedAsync). This goes to s0, gets NotLeaderException, is retried to s2
(the leader), and
completes there. s2's sliding window processes seq=1. s0's sliding window
never sees seq=1 (it
rejected it before reaching the sliding window).
3. Client sends the write "abc" (seq=2) to s2. But followers can't respond to
AppendEntries because
writeStateMachineData is blocked, which prevents them from acknowledging
the log entry.
4. s2 loses leadership after 544ms (LOST_MAJORITY_HEARTBEATS). Election
timeout is 300ms, and both
followers are blocked. s2 can't re-elect for 10s
(lostMajorityHeartbeatsRecently).
5. At +5s, test unblocks `WRITE_STATE_MACHINE_DATA`. At this moment there's
no leader, so the unblock
filter actually works correctly (both s0 and s1 are unblocked).
6. s0 becomes leader (term 2). The "abc" entry from term 1 gets committed.
The cluster is healthy (all
at commit index 3).
7. Client retries seq=2 to s0 (the new leader). But s0's server-side sliding
window has never
processed seq=1 for this client (the dummy Watch was only completed on
s2). The sliding window
queues seq=2 and waits for seq=1, which never arrives. The request hangs
until the Netty RPC
timeout (3s), retries, and loops forever.
In short: The dummy Watch (seq=1) was processed on the old leader (s2), but
the new leader (s0) never
saw it. The server-side sliding window on s0 blocks seq=2 waiting for seq=1
that will never come.
This is a pre-existing race condition — whenever the leader steps down during
this test (which is likely
given both followers are blocked and the election timeout is only 300ms),
the client gets stuck in this
loop.
> Fix flaky TestRaftAsyncWithNetty
> --------------------------------
>
> Key: RATIS-2498
> URL: https://issues.apache.org/jira/browse/RATIS-2498
> Project: Ratis
> Issue Type: Bug
> Reporter: Ivan Andika
> Priority: Major
>
> TestRaftAsyncWithNetty is recently flaky. The flaky tests include multiple
> tests under the RaftAsyncWithNetty.
>
> [https://github.com/apache/ratis/actions/runs/23957268626/job/69878546984]
> [https://github.com/apache/ratis/actions/runs/23956477384/job/69875932130]
> [https://github.com/apache/ratis/actions/runs/23739689985/job/69153331550]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)