[
https://issues.apache.org/jira/browse/KAFKA-19233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951069#comment-17951069
]
Travis Bischel commented on KAFKA-19233:
----------------------------------------
I rewired the guts of the client to deliberately trigger the case where
responses are lost due to connections being cut. If needed, I can change the
client to kill the connection rather than just repeated the request. Thus,
> Line 2 : OK, but why a full HB after a successful response from the broker on
> Line 1?
The client never "received" the response. The response is lost. The client is
re-issuing a full heartbeat. My immediate question is: why is the client
already fenced?
> Line 3: Why is the client repeating "CONSUMER_GROUP_HEARTBEAT uxNP e10."? If
> it got FENCED_MEMBER_EPOCH on Line 2 from the broker, the client should
> release its assignment, reset the epoch to 0, and only then HB again to
> rejoin (so HB with 0, not 10)
Loss simulation.
> Could you double check in this case, why is the fence error set on the
>broker?
I can't. I enabled every logger on the broker with
{{{}log4j.logger.kafka=DEBUG{}}}. There is no log related to decisions the
broker is making with adding or removing members from the group. I'd be quite
happy to have a problem in my properties file, but I saw nothing.
> Are this logs mixing several clients maybe?
The logs1 file attached in this issue are all logs for the member ID in the
cluster {{{}uxNPFKnjF3OrkZIAghLN1Q=={}}}. I deleted all logs for other clients.
The fourth line in that file shows the client joining with epoch 0 and
receiving a fenced response:
{noformat}
2025-05-02 15:22:32,739
[data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG
kafka.request.logger - Completed
request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":25,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
consumer group member has a smaller member epoch (0) than the one known by the
group coordinator (11). The member must abandon all its partitions and
rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.143,"requestQueueTimeMs":0.022,"localTimeMs":0.023,"remoteTimeMs":0.056,"throttleTimeMs":0,"responseQueueTimeMs":0.013,"sendTimeMs":0.027,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
{noformat}
The prior line, you can see the client is fenced with epoch 11.
That test demonstrates that the client releases its subscription and rejoins
with epoch 0 once it receives member fenced. There is no test that specifically
simulates what happens in the case that the first heartbeat response is lost. I
_suspect_ (I'm not positive) that if Confluent simulated some lost responses in
an integration test, there would be a _lot_ more fenced errors encountered than
expected.
If you're curious, this is the test I am running:
[https://github.com/twmb/franz-go/blob/master/pkg/kgo/helpers_test.go#L460-L557]
This is the code where I was ignoring the first response and reissuing the
request (i.e., response lost):
[https://github.com/twmb/franz-go/blob/master/pkg/kgo/consumer_group_848.go#L161-L172]
Some modifications will need to be made to trigger this issue – I have
commented out the loss simulation and plan to re-run it on future releases to
decide whether to opt into the next gen balancer or not.
> Members cannot rejoin with epoch=0 for KIP-848
> ----------------------------------------------
>
> Key: KAFKA-19233
> URL: https://issues.apache.org/jira/browse/KAFKA-19233
> Project: Kafka
> Issue Type: Bug
> Components: clients, consumer
> Reporter: Travis Bischel
> Priority: Major
> Attachments: logs1
>
>
> If a group is on generation > 1 and a member is fenced, the member cannot
> rejoin until the broker expires the member from the group.
> KIP-848 says "Upon receiving the UNKNOWN_MEMBER_ID or FENCED_MEMBER_EPOCH
> error, the consumer abandon all its partitions and rejoins with the same
> member id and the epoch 0.".
> However, the current implementation on the broker throws FENCED_MEMBER_EPOCH
> if the client provided epoch, when not equal to the current epoch, is
> anything other than the current epoch - 1.
> Specifically this line:
> [https://github.com/apache/kafka/blob/e68781414e9bcbc1d7cd5c247433a13f8d0e2e6e/group-coordinator/src/main/java/org/apache/kafka/coordinator/group/GroupMetadataManager.java#L1535]
> If the current epoch is 13, and I reset to epoch 0, the conditional always
> throws FENCED_MEMBER_EPOCH.
> Attached are logs of this case, here is a sample of a single log line
> demonstrating the problem:
> {code:java}
> 2025-05-02 15:23:09,304
> [data-plane-kafka-network-thread-3-ListenerName(PLAINTEXT)-PLAINTEXT-0] DEBUG
> kafka.request.logger - Completed
> request:{"isForwarded":false,"requestHeader":{"requestApiKey":68,"requestApiVersion":1,"correlationId":46,"clientId":"kgo","requestApiKeyName":"CONSUMER_GROUP_HEARTBEAT"},"request":{"groupId":"67660d2bfc7b197c91ff86623614522285c05c14b9f817fa99e6c105a2f54d7f","memberId":"uxNPFKnjF3OrkZIAghLN1Q==","memberEpoch":0,"instanceId":null,"rackId":null,"rebalanceTimeoutMs":60000,"subscribedTopicNames":["aed98f76851080d77b6098a03ea5ef088dabc21331462e44ed7ae5be463e2655"],"subscribedTopicRegex":null,"serverAssignor":"range","topicPartitions":[]},"response":{"throttleTimeMs":0,"errorCode":110,"errorMessage":"The
> consumer group member has a smaller member epoch (0) than the one known by
> the group coordinator (11). The member must abandon all its partitions and
> rejoin.","memberId":null,"memberEpoch":0,"heartbeatIntervalMs":0,"assignment":null},"connection":"127.0.0.1:9096-127.0.0.1:56686-0-292","totalTimeMs":0.801,"requestQueueTimeMs":0.159,"localTimeMs":0.106,"remoteTimeMs":0.315,"throttleTimeMs":0,"responseQueueTimeMs":0.066,"sendTimeMs":0.153,"securityProtocol":"PLAINTEXT","principal":"User:ANONYMOUS","listener":"PLAINTEXT","clientInformation":{"softwareName":"kgo","softwareVersion":"unknown"}}
> {code}
> The logs show the broker continuously responding errcode 110 for 50s until,
> I'm assuming, some condition boots the member from the group, such that the
> next time the broker receives the request, the member is considered new and
> the request is successful.
> The first heartbeat is duplicated; I noticed that Kafka replies with
> FENCED_MEMBER_EPOCH _way too often_ if a heartbeat is duplicated, and I'm
> trying to see if it's possible to work around that. As an aside, between the
> fenced error happening {_}a lot{_}, this issue, and KAFKA-19222, I'm leaning
> to not opt into KIP-848 by default until the broker implementation improves.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)