[ 
https://issues.apache.org/jira/browse/KAFKA-20113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christo Lolov updated KAFKA-20113:
----------------------------------
    Fix Version/s: 4.3.0
                       (was: 4.2.0)

> Add Configurable Retry Parameters for Status Backing Store
> ----------------------------------------------------------
>
>                 Key: KAFKA-20113
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20113
>             Project: Kafka
>          Issue Type: New Feature
>          Components: connect
>            Reporter: Said BOUDJELDA
>            Assignee: Said BOUDJELDA
>            Priority: Major
>              Labels: configuration, connect, improvement, reliability
>             Fix For: 4.3.0
>
>
> Implement configurable retry parameters for the +KafkaStatusBackingStore+ to 
> address the TODO comment "retry more gracefully and not forever" and provide 
> operators with control over retry behavior during transient failures.
> h3. Problem Statement
>  
> KafkaStatusBackingStore currently retries status updates indefinitely when 
> encountering retriable exceptions. This behavior is problematic because:
>  # *Infinite retry loops* can cause the worker to become unresponsive during 
> extended Kafka broker outages
>  # *No visibility* into retry behavior - operators cannot tune retry 
> parameters based on their environment
>  # *Resource exhaustion* - indefinite retries can consume threads and memory 
> during prolonged failures
>  # *No graceful degradation* - the system continues retrying without bound 
> rather than failing fast when appropriate
> A TODO comment in the codebase ({{{}// TODO: retry more gracefully and not 
> forever{}}}) explicitly acknowledges this issue needs addressing.
> h3. Proposed Solution
> Add four new configuration properties under the {{status.storage.}} prefix to 
> control retry behavior:
>  
> ||Property||Type||Default||Description||
> |{{status.storage.retry.max.retries}}|INT|5|Maximum number of retry attempts 
> before giving up|
> |{{status.storage.retry.initial.backoff.ms}}|LONG|300|Initial backoff delay 
> in milliseconds|
> |{{status.storage.retry.max.backoff.ms}}|LONG|10000|Maximum backoff delay cap 
> in milliseconds|
> |{{status.storage.retry.backoff.multiplier}}|DOUBLE|2.0|Multiplier applied to 
> backoff after each attempt|
> The retry mechanism uses *exponential backoff with jitter* to prevent 
> thundering herd problems during cluster recovery.
> h4. Behavior
>  * Retries occur only for exceptions marked as {{RetriableException}}
>  * After exhausting {{{}max.retries{}}}, the operation logs an error and 
> terminates gracefully
>  * All retry attempts are logged at WARN level with attempt count and delay 
> information
>  * Non-retriable exceptions fail immediately without retry
>  *  
> h3. Benefits
>  # *Predictable failure modes* - Workers eventually give up and surface 
> errors instead of hanging
>  # *Operator control* - Tune retry behavior based on environment 
> characteristics
>  # *Better observability* - Clear logging of retry attempts and outcomes
>  # *Backward compatible* - Default values maintain similar behavior to 
> current implementation
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to