Francois Visconte created KAFKA-16890:
-----------------------------------------
Summary: Failing to build aux state on broker failover
Key: KAFKA-16890
URL: https://issues.apache.org/jira/browse/KAFKA-16890
Project: Kafka
Issue Type: Bug
Components: Tiered-Storage
Affects Versions: 3.7.0, 3.7.1
Reporter: Francois Visconte
We have clusters where we replace machines often falling into a state where we
keep having "Error building remote log auxiliary state for loadtest_topic-22"
and the partition being under-replicated until the leader is manually
restarted.
Looking into a specific case, here is what we observed in __remote_log_metadata
topic:
{code:java}
partition: 29, offset: 183593, value:
RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=GZeRTXLMSNe2BQjRXkg6hQ}, startOffset=10823, endOffset=11536,
brokerId=10013, maxTimestampMs=1715774588597, eventTimestampMs=1715781657604,
segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324,
133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535},
segmentSizeInBytes=704895, customMetadata=Optional.empty,
state=COPY_SEGMENT_STARTED}
partition: 29, offset: 183594, value:
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty,
state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715781658183, brokerId=10013}
partition: 29, offset: 183669, value:
RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=L1TYzx0lQkagRIF86Kp0QQ}, startOffset=10823, endOffset=11544,
brokerId=10008, maxTimestampMs=1715781445270, eventTimestampMs=1715782717593,
segmentLeaderEpochs={125=10823, 126=10968, 128=11047, 130=11048, 131=11324,
133=11442, 134=11443, 135=11445, 136=11521, 137=11533, 139=11535, 140=11537,
142=11543}, segmentSizeInBytes=713088, customMetadata=Optional.empty,
state=COPY_SEGMENT_STARTED}
partition: 29, offset: 183670, value:
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty,
state=COPY_SEGMENT_FINISHED, eventTimestampMs=1715782718370, brokerId=10008}
partition: 29, offset: 186215, value:
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty,
state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874617, brokerId=10008}
partition: 29, offset: 186216, value:
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=L1TYzx0lQkagRIF86Kp0QQ}, customMetadata=Optional.empty,
state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874725, brokerId=10008}
partition: 29, offset: 186217, value:
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty,
state=DELETE_SEGMENT_STARTED, eventTimestampMs=1715867874729, brokerId=10008}
partition: 29, offset: 186218, value:
RemoteLogSegmentMetadataUpdate{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=ClnIeN0MQsi_d4FAOFKaDA:loadtest_topic-22,
id=GZeRTXLMSNe2BQjRXkg6hQ}, customMetadata=Optional.empty,
state=DELETE_SEGMENT_FINISHED, eventTimestampMs=1715867874817, brokerId=10008}
{code}
It seems that at the time the leader is restarted (10013), a second copy of the
same segment is tiered by the new leader (10008). Interestingly the segment
doesn't have the same end offset, which is concerning.
Then the follower sees the following error repeatedly until the leader is
restarted:
{code:java}
[2024-05-17 20:46:42,133] DEBUG [ReplicaFetcher replicaId=10013,
leaderId=10008, fetcherId=0] Handling errors in processFetchRequest for
partitions HashSet(loadtest_topic-22) (kafka.server.ReplicaFetcherThread)
[2024-05-17 20:46:43,174] DEBUG [ReplicaFetcher replicaId=10013,
leaderId=10008, fetcherId=0] Received error OFFSET_MOVED_TO_TIERED_STORAGE, at
fetch offset: 11537, topic-partition: loadtest_topic-22
(kafka.server.ReplicaFetcherThread)
[2024-05-17 20:46:43,175] ERROR [ReplicaFetcher replicaId=10013,
leaderId=10008, fetcherId=0] Error building remote log auxiliary state for
loadtest_topic-22 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.server.log.remote.storage.RemoteStorageException: Couldn't
build the state from remote store for partition: loadtest_topic-22,
currentLeaderEpoch: 153, leaderLocalLogStartOffset: 11545,
leaderLogStartOffset: 11537, epoch: 142as the previous remote log segment
metadata was not found
{code}
The follower is trying to fetch from 11537 and gets
OFFSET_MOVED_TO_TIERED_STORAGE . Then when the follower retries, it still
thinks it needs to fetch from 11537 . There is no data in S3, so the correct
leaderLogStartOffset should be 11545 . I'm not sure yet if its intentional
that there can be two copies of the same segment that are different uploaded to
S3 or if the segments were just deleted in the wrong order, but that is what
ultimately caused the leaderLogStartOffset to be set incorrectly.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)