[
https://issues.apache.org/jira/browse/KAFKA-16710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
hudeqi updated KAFKA-16710:
---------------------------
Attachment: (was: 企业微信截图_e47e04cf-dc5d-49e6-b32d-ba2934c8a50a.png)
> Continuously `makeFollower` may cause the replica fetcher thread to encounter
> an offset mismatch exception when `processPartitionData`
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-16710
> URL: https://issues.apache.org/jira/browse/KAFKA-16710
> Project: Kafka
> Issue Type: Bug
> Components: core, replication
> Affects Versions: 2.8.1, 3.8.0
> Reporter: hudeqi
> Assignee: hudeqi
> Priority: Blocker
> Attachments: 企业微信截图_230257fe-1c11-4e77-93b3-b8b8edce2ba3.png,
> 企业微信截图_a5d3e50f-6982-43f7-9263-5e3c5b49cc1e.png
>
>
> The scenario where this case occurs is during a reassignment of a partition:
> 110879, 110880 (original leader, original follower) ---> 110879, 110880,
> 110881, 113915 (the latter two replicas are new leader and new follower) --->
> 110881, 113915 (new leader, new follower). The "Offset mismatch" exception
> occurs on the new follower 113915.
> Through analysis, the exception occurs in the reassignment process:
> # After the new replicas 110881, 113915 are fully enqueued into the ISR, the
> controller will switch the leader from 110879 to 110881, and then send a new
> `leaderAndIsr` (leader is 110881, ISR is 110879, 110880, 110881, 113915) to
> 110881, 113915.
> # This time, 110881 executes `makeLeader`, and 113915 executes
> `makeFollower`. After the new follower 113915 completes
> `removeFetcherForPartitions` and `addFetcherForPartitions`, it starts
> fetching data from the new leader 110881, but because the log end offset of
> the new leader 110881 (18735600055) is smaller than the log end offset of the
> new follower 113915 (18735600059), the new follower 113915 adds the partition
> to `divergingEndOffsets` during `processFetchRequest` and then executes
> `truncateOnFetchResponse` to truncate the local log to 18735600055.
> # However, unfortunately, `truncateOnFetchResponse` needs to acquire the
> `partitionMapLock` lock, and at the same time, the new leader 110881 and the
> new follower 113915 also receive another `leaderAndIsr` request from the
> controller (to remove the old replicas 110879, 110880 from the ISR), and the
> `ReplicaFetcherManager` thread of the new follower 113915 executes the second
> `makeFollower` to acquire the `partitionMapLock` lock firstly and execute
> `removeFetcherForPartitions`, and then gets the local log end offset
> (18735600059) as the fetch offset, ready to execute `addFetcherForPartitions`
> again to update the fetch offset (18735600059) to the `partitionStates`.
> # But unfortunately, the follower fetcher thread that was ready to truncate
> the local log to 18735600055 firstly obtained the `partitionMapLock` lock and
> completed the truncation, and the log end offset is now 18735600055.
> # Then, the thread that executed the second `makeFollower` obtained the
> `partitionMapLock` lock and executed `addFetcherForPartitions` to update the
> outdated fetch offset (18735600059) to the `partitionStates`.
> # Finally, it leads to: the follower thread throws the following exception
> during `processPartitionData`: "java.lang.IllegalStateException: Offset
> mismatch for partition aiops-adplatform-interfacelog-191: fetched offset =
> 18735600059, log end offset = 18735600055."
>
> The relevant logs are attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)