[
https://issues.apache.org/jira/browse/KAFKA-18007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946998#comment-17946998
]
Mickael Maison commented on KAFKA-18007:
----------------------------------------
I had a deployment hitting this same error. It was running MirrorMaker in
dedicated mode (started via connect-mirror-maker.sh), mirroring data from A to
B.
The B to A flow was disabled (B->A.enabled = false) but an instance of
MirrorCheckpointConnector with client is B->A was still running and throwing
this exception. The trick to shut it down was to explicitly also disable the
heartbeat connector for B->A by setting B->A.emit.heartbeats.enabled=false in
the configuration.
The reason this happened is because by default emit.heartbeats.enabled is true
and in a A->B flow the heartbeats connector actually produces data to A (it's
inverted from the other MirrorMaker connectors). This caused
MirrorCheckpointConnector to try running in the B->A flow and because that flow
is disabled it failed and logged the exception.
This behavior is described in
[https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java#L126-L130|https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java#L126-L130]
and you can see in this method that a flow is still created even if it's
explicitly disabled if the heartbeats connector is not also explicitly disabled.
> MirrorCheckpointConnector fails with “Timeout while loading consumer groups”
> after upgrading to Kafka 3.9.0
> -----------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-18007
> URL: https://issues.apache.org/jira/browse/KAFKA-18007
> Project: Kafka
> Issue Type: Bug
> Components: mirrormaker
> Affects Versions: 3.9.0
> Environment: - Kafka Version: Upgraded sequentially from 3.6.0 to
> 3.9.0
> - Clusters: Three clusters named A, B, and C
> - Clusters A and B mirror topics to cluster C using MirrorMaker 2
> - Number of Consumer Groups: Approximately 200
> - Number of Topics: Approximately 2000
> - Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64)
> Reporter: Asker
> Priority: Major
>
> After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started
> experiencing repeated errors with the MirrorCheckpointConnector in
> MirrorMaker 2. The connector fails with a RetriableException stating “Timeout
> while loading consumer groups.” This issue persists despite several attempts
> to resolve it.
> Error Message:
> {code:bash}
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker
> clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to
> reconfigure connector's tasks (MirrorCheckpointConnector), retrying after
> backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]:
> org.apache.kafka.connect.errors.RetriableException: Timeout while loading
> consumer groups.
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech
> connect-mirror-maker.sh[2526630]: at
> java.base/java.lang.Thread.run(Thread.java:840){code}
> Steps to Reproduce:
> 1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0.
> 2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster
> C.
> 3. Start MirrorMaker 2.
> 4. Observe the logs for the MirrorCheckpointConnector.
> What We Tried:
> {*}Checked ACLs and Authentication{*}:
> - Ensured that the mirror_maker user has the necessary permissions and can
> authenticate successfully.
> - Verified that we could list consumer groups using kafka-consumer-groups.sh
> with the mirror_maker user.
> {*}Increased Timeouts{*}:
> - Increased admin.timeout.ms to 300000 (5 minutes) and even higher values.
> - Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly.
> {*}Enabled Detailed Logging{*}:
> - Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain
> more insights.
> - No additional information that could help resolve the issue was found.
> {*}Temporary Workarounds{*}:
> - Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to
> prevent the MirrorCheckpointConnector from running.
> - This is not a viable long-term solution as we need to synchronize consumer
> group offsets.
> Resolution:
> Rolled Back to Kafka 3.8.1:
> - As a test, we downgraded our Kafka clusters back to version 3.8.1.
> - After the downgrade, the error disappeared, and the
> MirrorCheckpointConnector functioned correctly.
> - This suggests that the issue was introduced in version 3.9.0.
> Analysis:
> Possible Relation to KAFKA-17232:
> - We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does
> not generate task configs if initial consumer group load times out.”
> - It appears that changes introduced in Kafka 3.9.0 related to this issue
> may have inadvertently caused our problem.
> - However, our clusters are not particularly large, and the initial consumer
> group load should not exceed the timeouts.
> Request:
> {*}Assistance in Resolving the Issue{*}:
> - Is there a known workaround or configuration change that can prevent this
> error in Kafka 3.9.0?
> - Could the changes made in KAFKA-17232 have unintentionally caused this
> problem?
> - Are there plans to address this issue in an upcoming release?
> *Guidance on Next Steps*:
> - Should we avoid upgrading to versions beyond 3.8.1 until this issue is
> resolved?
> - Is it advisable to apply any patches or pull requests manually?
> Thank you for your attention to this matter. Please let me know if I can
> provide any additional information to help resolve this issue.
> Best regards,
> Asker Kakhramanov
--
This message was sent by Atlassian Jira
(v8.20.10#820010)