Brandon Barron created KAFKA-20398:
--------------------------------------
Summary: Memory leak when stream threads are replaced
Key: KAFKA-20398
URL: https://issues.apache.org/jira/browse/KAFKA-20398
Project: Kafka
Issue Type: Bug
Components: streams
Affects Versions: 4.0.0
Reporter: Brandon Barron
It appears there are some unreleased objects when stream threads are replaced.
This didn't occur before 4.x, but in testing multiple client versions since
4.x, we've seen memory fill up when stream threads are frequently replaced from
within our
StreamsUncaughtExceptionHandler.
In the heap dump extracted from one of our testing instances that went OOM (5
stream threads, 128M heap):
{noformat}
444 instances of org.apache.kafka.clients.consumer.KafkaConsumer, loaded by
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xfacdc630 occupy 67.01 MB
(54.63%) bytes.
222 instances of org.apache.kafka.streams.processor.internals.TaskManager,
loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xfacdc630occupy
29.70 MB (24.21%) bytes.{noformat}
These class counts line up pretty closely to the number of thread replacements
between starting the application termination due to OOM. The TaskManager class
count is almost 1:1 to the number of thread replacements.
In one test, we triggered thread replacements in a slow loop for a few hours,
then allowed it to run normally after that point. During the thread replacement
loops, the memory filled up fairly quickly. After processing went back to
normal, memory levels sat close to the heap limit for multiple days without
running into OOM, but also without reclaiming any heap space built up from
those first few hours.
{*}Minimal example for reproducing{*}:
[https://github.com/bwbarron/kstreams-mem-thread-replace]
*Summary of my results using this minimal example ({+}using 128M heap{+}):*
Client version 3.9.2 * Left running for 24 hours with no OOM
Client version 4.0.x * Typically takes about 7 minutes of replace thread loop
to reach OOM
Client version 4.1.x & 4.2.0 * Thread replacements seem to happen much slower
than older versions (looks to be about 45sec in between replacements)
* Took roughly 3.5 hours to get OOM, likely due to slower loop execution
--
This message was sent by Atlassian Jira
(v8.20.10#820010)