[
https://issues.apache.org/jira/browse/KAFKA-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Gustafson updated KAFKA-7415:
-----------------------------------
Fix Version/s: 1.1.2
> OffsetsForLeaderEpoch may incorrectly respond with undefined epoch causing
> truncation to HW
> -------------------------------------------------------------------------------------------
>
> Key: KAFKA-7415
> URL: https://issues.apache.org/jira/browse/KAFKA-7415
> Project: Kafka
> Issue Type: Bug
> Components: replication
> Affects Versions: 2.0.0
> Reporter: Anna Povzner
> Assignee: Jason Gustafson
> Priority: Major
> Fix For: 1.1.2, 2.0.1, 2.1.0
>
>
> If the follower's last appended epoch is ahead of the leader's last appended
> epoch, the OffsetsForLeaderEpoch response will incorrectly send
> (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET), and the follower will truncate to
> HW. This may lead to data loss in some rare cases where 2 back-to-back leader
> elections happen (failure of one leader, followed by quick re-election of the
> next leader due to preferred leader election, so that all replicas are still
> in the ISR, and then failure of the 3rd leader).
> The bug is in LeaderEpochFileCache.endOffsetFor(), which returns
> (UNDEFINED_EPOCH, UNDEFINED_EPOCH_OFFSET) if the requested leader epoch is
> ahead of the last leader epoch in the cache. The method should return (last
> leader epoch in the cache, LEO) in this scenario.
> We don't create an entry in a leader epoch cache until a message is appended
> with the new leader epoch. Every append to log calls
> LeaderEpochFileCache.assign(). However, it would be much cleaner if
> `makeLeader` created an entry in the cache as soon as replica becomes a
> leader, which will fix the bug. In case the leader never appends any
> messages, and the next leader epoch starts with the same offset, we already
> have clearAndFlushLatest() that clears entries with start offsets greater or
> equal to the passed offset. LeaderEpochFileCache.assign() could be merged
> with clearAndFlushLatest(), so that we clear cache entries with offsets equal
> or greater than the start offset of the new epoch, so that we do not need to
> call these methods separately.
>
> Here is an example of a scenario where the issue leads to the data loss.
> Suppose we have three replicas: r1, r2, and r3. Initially, the ISR consists
> of (r1, r2, r3) and the leader is r1. The data up to offset 10 has been
> committed to the ISR. Here is the initial state:
> {code:java}
> Leader: r1
> leader epoch: 0
> ISR(r1, r2, r3)
> r1: [hw=10, leo=10]
> r2: [hw=8, leo=10]
> r3: [hw=5, leo=10]
> {code}
> Replica 1 fails and leaves the ISR, which makes Replica 2 the new leader with
> leader epoch = 1. The leader appends a batch, but it is not replicated yet to
> the followers.
> {code:java}
> Leader: r2
> leader epoch: 1
> ISR(r2, r3)
> r1: [hw=10, leo=10]
> r2: [hw=8, leo=11]
> r3: [hw=5, leo=10]
> {code}
> Replica 3 is elected a leader (due to preferred leader election) before it
> has a chance to truncate, with leader epoch 2.
> {code:java}
> Leader: r3
> leader epoch: 2
> ISR(r2, r3)
> r1: [hw=10, leo=10]
> r2: [hw=8, leo=11]
> r3: [hw=5, leo=10]
> {code}
> Replica 2 sends OffsetsForLeaderEpoch(leader epoch = 1) to Replica 3. Replica
> 3 incorrectly replies with UNDEFINED_EPOCH_OFFSET, and Replica 2 truncates to
> HW. If Replica 3 fails before Replica 2 re-fetches the data, this may lead to
> data loss.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)