[
https://issues.apache.org/jira/browse/KAFKA-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Christo Lolov updated KAFKA-20113:
----------------------------------
Fix Version/s: 4.3.0
(was: 4.2.0)
> Add Configurable Retry Parameters for Status Backing Store
> ----------------------------------------------------------
>
> Key: KAFKA-20113
> URL: https://issues.apache.org/jira/browse/KAFKA-20113
> Project: Kafka
> Issue Type: New Feature
> Components: connect
> Reporter: Said BOUDJELDA
> Assignee: Said BOUDJELDA
> Priority: Major
> Labels: configuration, connect, improvement, reliability
> Fix For: 4.3.0
>
>
> Implement configurable retry parameters for the +KafkaStatusBackingStore+ to
> address the TODO comment "retry more gracefully and not forever" and provide
> operators with control over retry behavior during transient failures.
> h3. Problem Statement
>
> KafkaStatusBackingStore currently retries status updates indefinitely when
> encountering retriable exceptions. This behavior is problematic because:
> # *Infinite retry loops* can cause the worker to become unresponsive during
> extended Kafka broker outages
> # *No visibility* into retry behavior - operators cannot tune retry
> parameters based on their environment
> # *Resource exhaustion* - indefinite retries can consume threads and memory
> during prolonged failures
> # *No graceful degradation* - the system continues retrying without bound
> rather than failing fast when appropriate
> A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not
> forever{}}}) explicitly acknowledges this issue needs addressing.
> h3. Proposed Solution
> Add four new configuration properties under the {{status.storage.}} prefix to
> control retry behavior:
>
> ||Property||Type||Default||Description||
> |{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts
> before giving up|
> |{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay
> in milliseconds|
> |{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap
> in milliseconds|
> |{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to
> backoff after each attempt|
> The retry mechanism uses *exponential backoff with jitter* to prevent
> thundering herd problems during cluster recovery.
> h4. Behavior
> * Retries occur only for exceptions marked as {{RetriableException}}
> * After exhausting {{{}max.retries{}}}, the operation logs an error and
> terminates gracefully
> * All retry attempts are logged at WARN level with attempt count and delay
> information
> * Non-retriable exceptions fail immediately without retry
> *
> h3. Benefits
> # *Predictable failure modes* - Workers eventually give up and surface
> errors instead of hanging
> # *Operator control* - Tune retry behavior based on environment
> characteristics
> # *Better observability* - Clear logging of retry attempts and outcomes
> # *Backward compatible* - Default values maintain similar behavior to
> current implementation
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)