[
https://issues.apache.org/jira/browse/KAFKA-17455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17878249#comment-17878249
]
Colt McNealy commented on KAFKA-17455:
--------------------------------------
This is probably a problem in the Producer rather than with Streams itself.
I am confused at what causes this. The Kafka Quota documentation states:
> Byte-rate and thread utilization are measured over multiple small windows
> (e.g. 30 windows of 1 second each) in order to detect and correct quota
> violations quickly. Typically, having large measurement windows (for e.g. 10
> windows of 30 seconds each) leads to large bursts of traffic followed by long
> delays which is not great in terms of user experience.
Our application was configured with `commit.interval.ms` of 100ms, and a
producer timeout of 60 seconds (you can see that from the stacktrace).
Once the broker started throttling, the Streams Commit seemed to hang.
I am confused as to _why_ this would happen, given that:
* Our app only exceeded quota by 5-10%
* Quota enforcement is supposedly gradual according to the quota enforcement
docs
* We configured Streams to commit every 100ms, which implies that the producer
send buffer also flushes every 100ms, so the quota should be experienced
gradually.
* The producer request timeout was 60 seconds. Given the `TimeoutException` it
means that a 60-second timeout was exhausted. This is not likely due to normal
throttling given how we were just barely over the quota.
This leads me to think there's a bug in the retries or throttling mechanism on
the producer methods for committing transactions.
cc [~eduwerc] and [~mjsax]
> `TaskCorruptedException` After Client Quota Throttling
> ------------------------------------------------------
>
> Key: KAFKA-17455
> URL: https://issues.apache.org/jira/browse/KAFKA-17455
> Project: Kafka
> Issue Type: Bug
> Components: clients, streams
> Affects Versions: 3.8.0
> Reporter: Colt McNealy
> Priority: Major
>
> When running a Kafka Streams EOS app that goes slightly above a configured
> user quota, we can reliably reproduce `TaskCorruptedException`s after
> throttling. This is the case even with an application that goes only 5-10%
> above the configured quota.
>
> The root cause is a `TimeoutException` encountered in the
> `TaskExecutor.commitOffsetsOrTransaction`.
>
> Stacktrace provided below:
>
> ```
> 19:45:28 ERROR [KAFKA] TaskExecutor - stream-thread
> [basic-tls-0-core-StreamThread-2] Committing task(s) 1_2 failed.
> org.apache.kafka.common.errors.TimeoutException: Timeout expired after
> 60000ms while awaiting AddOffsetsToTxn 19:45:28 WARN [KAFKA] StreamThread -
> stream-thread [basic-tls-0-core-StreamThread-2] Detected the states of tasks
> [1_2] are corrupted. Will close the task as dirty and re-create and bootstrap
> from scratch. org.apache.kafka.streams.errors.TaskCorruptedException: Tasks
> [1_2] are corrupted and hence need to be re-initialized at
> org.apache.kafka.streams.processor.internals.TaskExecutor.commitOffsetsOrTransaction(TaskExecutor.java:249)
> ~[server.jar:?] at
> org.apache.kafka.streams.processor.internals.TaskExecutor.commitTasksAndMaybeUpdateCommittableOffsets(TaskExecutor.java:154)
> ~[server.jar:?] at
> org.apache.kafka.streams.processor.internals.TaskManager.commitTasksAndMaybeUpdateCommittableOffsets(TaskManager.java:1915)
> ~[server.jar:?] at
> org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:1882)
> ~[server.jar:?] at
> org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:1384)
> ~[server.jar:?] at
> org.apache.kafka.streams.processor.internals.StreamThread.runOnceWithoutProcessingThreads(StreamThread.java:1033)
> ~[server.jar:?] at
> org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:711)
> [server.jar:?] at
> org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:670)
> [server.jar:?]
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)