[
https://issues.apache.org/jira/browse/KAFKA-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Travis Bischel updated KAFKA-19233:
-----------------------------------
Description:
If a group is on generation > 1 and a member is fenced, the member cannot
rejoin until the broker expires the member from the group.
KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH
error, the consumer abandon all its partitions and rejoins with the same member
id and the epoch 0.".
However, the current implementation on the broker throws FENCED_MEMBER_EPOCH if
the client provided epoch, when not equal to the current epoch, is anything
other than the current epoch - 1.
Specifically this line:
[https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]
If the current epoch is 13, and I reset to epoch 0, the conditional always
throws FENCED_MEMBER_EPOCH.
Attached are logs of this case, here is a sample of a single log line
demonstrating the problem:
{code:java}
2025-05-02 15:23:09,304
[data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG
kafka.request.logger - Completed
request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
consumer group member has a smaller member epoch (0) than the one known by the
group coordinator (11). The member must abandon all its partitions and
rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
{code}
The logs show the broker continuously responding errcode 110 for 50s until, I'm
assuming, some condition boots the member from the group, such that the next
time the broker receives the request, the member is considered new and the
request is successful.
The first heartbeat is duplicated; I noticed that Kafka replies with
FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm
trying to see if it's possible to work around that. As an aside, between the
fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning to
not opt into KIP-848 by default until the broker implementation improves.
was:
If a group is on generation > 1 and a member is fenced, the member cannot
rejoin until the broker expires the member from the group.
KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH
error, the consumer abandon all its partitions and rejoins with the same member
id and the epoch 0.".
However, the current implementation on the broker throws FENCED_LEADER_EPOCH if
the client provided epoch, when not equal to the current epoch, is anything
other than the current epoch - 1.
Specifically this line:
https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535
If the current epoch is 13, and I reset to epoch 0, the conditional always
throws FENCED_LEADER_EPOCH.
Attached are logs of this case, here is a sample of a single log line
demonstrating the problem:
{code}
2025-05-02 15:23:09,304
[data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG
kafka.request.logger - Completed
request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
consumer group member has a smaller member epoch (0) than the one known by the
group coordinator (11). The member must abandon all its partitions and
rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
{code}
The logs show the broker continuously responding errcode 110 for 50s until, I'm
assuming, some condition boots the member from the group, such that the next
time the broker receives the request, the member is considered new and the
request is successful.
The first heartbeat is duplicated; I noticed that Kafka replies with
FENCED_LEADER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm
trying to see if it's possible to work around that. As an aside, between the
fenced error happening _a lot_, this issue, and KAFKA-19222, I'm leaning to not
opt into KIP-848 by default until the broker implementation improves.
> Members cannot rejoin with epoch=0 for KIP-848
> ----------------------------------------------
>
> Key: KAFKA-19233
> URL: https://issues.apache.org/jira/browse/KAFKA-19233
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer
> Reporter: Travis Bischel
> Priority: Major
> Attachments: logs1
>
>
> If a group is on generation > 1 and a member is fenced, the member cannot
> rejoin until the broker expires the member from the group.
> KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH
> error, the consumer abandon all its partitions and rejoins with the same
> member id and the epoch 0.".
> However, the current implementation on the broker throws FENCED_MEMBER_EPOCH
> if the client provided epoch, when not equal to the current epoch, is
> anything other than the current epoch - 1.
> Specifically this line:
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]
> If the current epoch is 13, and I reset to epoch 0, the conditional always
> throws FENCED_MEMBER_EPOCH.
> Attached are logs of this case, here is a sample of a single log line
> demonstrating the problem:
> {code:java}
> 2025-05-02 15:23:09,304
> [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG
> kafka.request.logger - Completed
> request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
> consumer group member has a smaller member epoch (0) than the one known by
> the group coordinator (11). The member must abandon all its partitions and
> rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
> {code}
> The logs show the broker continuously responding errcode 110 for 50s until,
> I'm assuming, some condition boots the member from the group, such that the
> next time the broker receives the request, the member is considered new and
> the request is successful.
> The first heartbeat is duplicated; I noticed that Kafka replies with
> FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm
> trying to see if it's possible to work around that. As an aside, between the
> fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning
> to not opt into KIP-848 by default until the broker implementation improves.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)