[
https://issues.apache.org/jira/browse/KAFKA-16890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Satish Duggana resolved KAFKA-16890.
------------------------------------
Resolution: Fixed
> Failing to build aux state on broker failover
> ---------------------------------------------
>
> Key: KAFKA-16890
> URL: https://issues.apache.org/jira/browse/KAFKA-16890
> Project: Kafka
> Issue Type: Bug
> Components: Tiered-Storage
> Affects Versions: 3.7.0, 3.7.1
> Reporter: Francois Visconte
> Assignee: Kamal Chandraprakash
> Priority: Major
> Fix For: 3.8.0
>
>
> We have clusters where we replace machines often falling into a state where
> we keep having "Error building remote log auxiliary state for
> loadtest_topic-22" and the partition being under-replicated until the leader
> is manually restarted.
> Looking into a specific case, here is what we observed in
> __remote_log_metadata topic:
> {code:java}
>
> partition: 29, offset: 183593, value:
> RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=GZeRTXLMSNe2BQjRXkg6hQ}, startOffset=10823, endOffset=11536,
> brokerId=10013, maxTimestampMs=1715774588597, eventTimestampMs=1715781657604,
> segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324,
> 133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535},
> segmentSizeInBytes=704895, customMetadata=Optional.empty,
> state=COPY_SEGMENT_STARTED}
> partition: 29, offset: 183594, value:
> RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty,
> state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715781658183, brokerId=10013}
> partition: 29, offset: 183669, value:
> RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=L1TYzx0lQkagRIF86Kp0QQ}, startOffset=10823, endOffset=11544,
> brokerId=10008, maxTimestampMs=1715781445270, eventTimestampMs=1715782717593,
> segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324,
> 133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535, 140=11537,
> 142=11543}, segmentSizeInBytes=713088, customMetadata=Optional.empty,
> state=COPY_SEGMENT_STARTED}
> partition: 29, offset: 183670, value:
> RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty,
> state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715782718370, brokerId=10008}
> partition: 29, offset: 186215, value:
> RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty,
> state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874617, brokerId=10008}
> partition: 29, offset: 186216, value:
> RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty,
> state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874725, brokerId=10008}
> partition: 29, offset: 186217, value:
> RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty,
> state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874729, brokerId=10008}
> partition: 29, offset: 186218, value:
> RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
> id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty,
> state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874817, brokerId=10008}
> {code}
>
> It seems that at the time the leader is restarted (10013), a second copy of
> the same segment is tiered by the new leader (10008). Interestingly the
> segment doesn't have the same end offset, which is concerning.
> Then the follower sees the following error repeatedly until the leader is
> restarted:
>
> {code:java}
> [2024-05-17 20:46:42,133] DEBUG [ReplicaFetcher replicaId=10013,
> leaderId=10008, fetcherId=0] Handling errors in processFetchRequest for
> partitions HashSet(loadtest_topic-22) (kafka.server.ReplicaFetcherThread)
> [2024-05-17 20:46:43,174] DEBUG [ReplicaFetcher replicaId=10013,
> leaderId=10008, fetcherId=0] Received error OFFSET_MOVED_TO_TIERED_STORAGE,
> at fetch offset: 11537, topic-partition: loadtest_topic-22
> (kafka.server.ReplicaFetcherThread)
> [2024-05-17 20:46:43,175] ERROR [ReplicaFetcher replicaId=10013,
> leaderId=10008, fetcherId=0] Error building remote log auxiliary state for
> loadtest_topic-22 (kafka.server.ReplicaFetcherThread)
> org.apache.kafka.server.log.remote.storage.RemoteStorageException: Couldn't
> build the state from remote store for partition: loadtest_topic-22,
> currentLeaderEpoch: 153, leaderLocalLogStartOffset: 11545,
> leaderLogStartOffset: 11537, epoch: 142as the previous remote log segment
> metadata was not found
> {code}
> The follower is trying to fetch from 11537 and gets
> OFFSET_MOVED_TO_TIERED_STORAGE . Then when the follower retries, it still
> thinks it needs to fetch from 11537 . There is no data in S3, so the correct
> leaderLogStartOffset should be 11545 . I'm not sure yet if its intentional
> that there can be two copies of the same segment that are different uploaded
> to S3 or if the segments were just deleted in the wrong order, but that is
> what ultimately caused the leaderLogStartOffset to be set incorrectly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)