[jira] [Updated] (KAFKA-19233) Members cannot rejoin with epoch=0 for KIP-848

Travis Bischel (Jira) Fri, 02 May 2025 13:50:05 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Travis Bischel updated KAFKA-19233:
-----------------------------------
    Description: 
If a group is on generation > 1 and a member is fenced, the member cannot 
rejoin until the broker expires the member from the group.

KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH 
error, the consumer abandon all its partitions and rejoins with the same member 
id and the epoch 0.".

However, the current implementation on the broker throws FENCED_MEMBER_EPOCH if 
the client provided epoch, when not equal to the current epoch, is anything 
other than the current epoch - 1.

Specifically this line: 
[https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]

If the current epoch is 13, and I reset to epoch 0, the conditional always 
throws FENCED_MEMBER_EPOCH.

Attached are logs of this case, here is a sample of a single log line 
demonstrating the problem:
{code:java}
2025-05-02 15:23:09,304 
[data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
kafka.request.logger - Completed 
request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
 consumer group member has a smaller member epoch (0) than the one known by the 
group coordinator (11). The member must abandon all its partitions and 
rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
{code}
The logs show the broker continuously responding errcode 110 for 50s until, I'm 
assuming, some condition boots the member from the group, such that the next 
time the broker receives the request, the member is considered new and the 
request is successful.

The first heartbeat is duplicated; I noticed that Kafka replies with 
FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm 
trying to see if it's possible to work around that. As an aside, between the 
fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning to 
not opt into KIP-848 by default until the broker implementation improves.

  was:
If a group is on generation > 1 and a member is fenced, the member cannot 
rejoin until the broker expires the member from the group.

KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH 
error, the consumer abandon all its partitions and rejoins with the same member 
id and the epoch 0.".

However, the current implementation on the broker throws FENCED_LEADER_EPOCH if 
the client provided epoch, when not equal to the current epoch, is anything 
other than the current epoch - 1.

Specifically this line: 
https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535

If the current epoch is 13, and I reset to epoch 0, the conditional always 
throws FENCED_LEADER_EPOCH.

Attached are logs of this case, here is a sample of a single log line 
demonstrating the problem:

{code}
2025-05-02 15:23:09,304 
[data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
kafka.request.logger - Completed 
request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
 consumer group member has a smaller member epoch (0) than the one known by the 
group coordinator (11). The member must abandon all its partitions and 
rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
{code}

The logs show the broker continuously responding errcode 110 for 50s until, I'm 
assuming, some condition boots the member from the group, such that the next 
time the broker receives the request, the member is considered new and the 
request is successful.

The first heartbeat is duplicated; I noticed that Kafka replies with 
FENCED_LEADER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm 
trying to see if it's possible to work around that. As an aside, between the 
fenced error happening _a lot_, this issue, and KAFKA-19222, I'm leaning to not 
opt into KIP-848 by default until the broker implementation improves.


> Members cannot rejoin with epoch=0 for KIP-848
> ----------------------------------------------
>
>                 Key: KAFKA-19233
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19233
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, consumer
>            Reporter: Travis Bischel
>            Priority: Major
>         Attachments: logs1
>
>
> If a group is on generation > 1 and a member is fenced, the member cannot 
> rejoin until the broker expires the member from the group.
> KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH 
> error, the consumer abandon all its partitions and rejoins with the same 
> member id and the epoch 0.".
> However, the current implementation on the broker throws FENCED_MEMBER_EPOCH 
> if the client provided epoch, when not equal to the current epoch, is 
> anything other than the current epoch - 1.
> Specifically this line: 
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]
> If the current epoch is 13, and I reset to epoch 0, the conditional always 
> throws FENCED_MEMBER_EPOCH.
> Attached are logs of this case, here is a sample of a single log line 
> demonstrating the problem:
> {code:java}
> 2025-05-02 15:23:09,304 
> [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG 
> kafka.request.logger - Completed 
> request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
>  consumer group member has a smaller member epoch (0) than the one known by 
> the group coordinator (11). The member must abandon all its partitions and 
> rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
> {code}
> The logs show the broker continuously responding errcode 110 for 50s until, 
> I'm assuming, some condition boots the member from the group, such that the 
> next time the broker receives the request, the member is considered new and 
> the request is successful.
> The first heartbeat is duplicated; I noticed that Kafka replies with 
> FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm 
> trying to see if it's possible to work around that. As an aside, between the 
> fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning 
> to not opt into KIP-848 by default until the broker implementation improves.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19233) Members cannot rejoin with epoch=0 for KIP-848

Reply via email to