[
https://issues.apache.org/jira/browse/KAFKA-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17854091#comment-17854091
]
Chris Egerton edited comment on KAFKA-16931 at 6/11/24 3:06 PM:
----------------------------------------------------------------
First, one small clarification: task restarts do not result in zombie fencings
unless no successful zombie fencing has taken place yet for the current
generation of task configs. They do require an unconditional REST request to
the leader to check on whether that fencing has taken place yet, and to perform
one if it hasn't.
With that out of the way, a KIP would definitely be required if we wanted to
add new configurations related to retries. We could add some hard-coded retry
logic for now, which IMO wouldn't require a KIP. The tricky part either way
would be striking a balance between resiliency to transient failures (which the
current design certainly lacks) and surfacing non-retriable errors to users in
an easily-accessible manner (which, despite its shortcomings, the current
design does fairly well).
If we do decide to add new configuration properties, perhaps they could apply
for all inter-worker REST requests (including requests to the {{PUT
/connectors/<connector>/fence}} endpoint, the {{PUT
/connectors/<connector>/tasks}} endpoint, and user-initiated requests that are
forwarded from one worker to another)? It's also a bit of a sharp edge that,
right now, failures to forward task configs to the leader are retried
infinitely with nothing but a ton of {{{}ERROR{}}}-level log messages to
indicate any sign of unhealthiness, and it could be useful to allow the
connector to fail at some point instead.
was (Author: chrisegerton):
First, one small clarification: task restarts do not result in zombie fencings
unless no successful zombie fencing has taken place yet for the current
generation of task configs. They do require an unconditional REST request to
the leader to check on whether that fencing has taken place yet, and to perform
one if it hasn't.
With that out of the way, a KIP would definitely be required if we wanted to
add new configurations related to retries. We could add some hard-coded retry
logic for now, which IMO wouldn't require a KIP. The tricky part either way
would be striking a balance between resiliency to transient failures (which the
current design certainly lacks) and surfacing non-retriable errors to users in
an easily-accessible manner (which, despite its shortcomings, the current
design does fairly well).
> Transient REST failures to forward fenceZombie requests leave Connect Tasks
> in FAILED state
> -------------------------------------------------------------------------------------------
>
> Key: KAFKA-16931
> URL: https://issues.apache.org/jira/browse/KAFKA-16931
> Project: Kafka
> Issue Type: Bug
> Components: connect
> Reporter: Edoardo Comar
> Priority: Major
>
> When Kafka Connect runs in exactly_once mode, a task restart will fence
> possible zombies tasks.
> This is achieved forwarding the request to the leader worker using the REST
> protocol.
> At scale, in distributed mode, occasionally an HTTPs request may fail because
> of a networking glitch, reconfiguration etc
> Currently there is no attempt to retry the REST request, the task is left in
> a FAILED state and requires an external restart (with the REST API).
> Would this issue require a small KIP to introduce configuration entries to
> limit the number of retries, backoff times etc ?
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)