[
https://issues.apache.org/jira/browse/KAFKA-16281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jack Vanlightly updated KAFKA-16281:
------------------------------------
Summary: Possible IllegalState with KIP-996 (was: Probable IllegalState
possible with KIP-966)
> Possible IllegalState with KIP-996
> ----------------------------------
>
> Key: KAFKA-16281
> URL: https://issues.apache.org/jira/browse/KAFKA-16281
> Project: Kafka
> Issue Type: Task
> Components: kraft
> Reporter: Jack Vanlightly
> Priority: Major
>
> I have a TLA+ model of KIP-966 and I have identified an IllegalState
> exception that would occur with the existing MaybeHandleCommonResponse
> behavior.
> The issue stems from the fact that a leader, let's call it r1, can resign
> (either due to a restart or check quorum) and then later initiate a pre-vote
> where it ends up in the same epoch as before, but a cleared local leader id.
> When r1 transitions to Prospective it clears its local leader id. When r1
> receives a response from r2 who believes that r1 is still the leader, the
> logic in MaybeHandleCommonResponse tries to transition r1 to follower of
> itself, causing an IllegalState exception to be raised.
> This is an example history:
> # r1 is the leader in epoch 1.
> # r1 quorum resigns, or restarts and resigns.
> # r1 experiences an election timeout and transitions to Prospective clearing
> its local leader id.
> # r1 sends a pre vote request to its peers.
> # r2 thinks r1 is still the leader, sends a vote response, not granting its
> vote and setting leaderId=r1 and epoch=1.
> # r1 receives the vote response and executes MaybeHandleCommonResponse which
> tries to transition r1 to Follower of itself and an illegal state occurs.
> The relevant else if statement in MaybeHandleCommonResponse is here:
> https://github.com/apache/kafka/blob/a26a1d847f1884a519561e7a4fb4cd13e051c824/raft/src/main/java/org/apache/kafka/raft/KafkaRaftClient.java#L1538
> In the TLA+ specification, I fixed this issue by adding a fourth condition to
> this statement, that the leaderId also does not equal this server's id.
> [https://github.com/Vanlightly/kafka-tlaplus/blob/9b2600d1cd5c65930d666b12792d47362b64c015/kraft/kip_996/kraft_kip_996_functions.tla#L336]
> We should probably create a test to confirm the issue first and then look at
> using the fix I made in the TLA+, though there may be other options.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)