Chris Egerton created KAFKA-17155:
-------------------------------------
Summary: Redundant rebalances triggered after connector
creation/deletion and task config updates
Key: KAFKA-17155
URL: https://issues.apache.org/jira/browse/KAFKA-17155
Project: Kafka
Issue Type: Bug
Components: connect
Affects Versions: 3.8.0, 3.9.0
Reporter: Chris Egerton
With KAFKA-17105, a scenario is described where a connector may be
unnecessarily restarted soon after it has been created.
Similarly, when any events occur that set the
[DistributedHerder.needsReconfigRebalance
flag|https://github.com/apache/kafka/blob/a66a59f427b30611175fd029d86832d00aa5aabd/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/DistributedHerder.java#L215]
to true (at the time of writing these are the detection of a new connector,
the removal of an existing connector, or the detection of new task
configurations regardless of whether existing configurations existed for the
connector), it is possible that a rebalance has already started because another
worker has detected this change as well. In that case,
{{needsReconfigRebalance}} will still be set to {{true}} even after that
rebalance has taken place, and the worker will force an unnecessary second
rebalance.
We might consider changing the "needs reconfig rebalance" field into a
"reconfig rebalance threshold" field, which contains the latest offset of a
record consumed from the config topic that warrants a rebalance. When possibly
performing rebalances based on this field, the worker can check if the offset
in the assignment given out by the leader during the most recent rebalance is
greater than or equal to this threshold, and if so, choose not to force a
rebalance.
This has been caused issues in some tests, but may be a benign race condition
that does not have practical consequences in the real world. We may not want to
address this (especially with an approach that increases the complexity of the
code base and comes with risk of regression) until/unless someone states that
it's affected them outside of Kafka Connect unit tests.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)