Brandon Barron created KAFKA-20398:
--------------------------------------

             Summary: Memory leak when stream threads are replaced
                 Key: KAFKA-20398
                 URL: https://issues.apache.org/jira/browse/KAFKA-20398
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 4.0.0
            Reporter: Brandon Barron


It appears there are some unreleased objects when stream threads are replaced. 
This didn't occur before 4.x, but in testing multiple client versions since 
4.x, we've seen memory fill up when stream threads are frequently replaced from 
within our 
StreamsUncaughtExceptionHandler.
In the heap dump extracted from one of our testing instances that went OOM (5 
stream threads, 128M heap):

 
{noformat}
444 instances of org.apache.kafka.clients.consumer.KafkaConsumer, loaded by 
jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xfacdc630 occupy 67.01 MB 
(54.63%) bytes.

222 instances of org.apache.kafka.streams.processor.internals.TaskManager, 
loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0xfacdc630occupy 
29.70 MB (24.21%) bytes.{noformat}
 
These class counts line up pretty closely to the number of thread replacements 
between starting the application termination due to OOM. The TaskManager class 
count is almost 1:1 to the number of thread replacements.
 
In one test, we triggered thread replacements in a slow loop for a few hours, 
then allowed it to run normally after that point. During the thread replacement 
loops, the memory filled up fairly quickly. After processing went back to 
normal, memory levels sat close to the heap limit for multiple days without 
running into OOM, but also without reclaiming any heap space built up from 
those first few hours.
 
{*}Minimal example for reproducing{*}: 
[https://github.com/bwbarron/kstreams-mem-thread-replace]
 
*Summary of my results using this minimal example ({+}using 128M heap{+}):*
Client version 3.9.2 * Left running for 24 hours with no OOM

Client version 4.0.x * Typically takes about 7 minutes of replace thread loop 
to reach OOM

Client version 4.1.x & 4.2.0 * Thread replacements seem to happen much slower 
than older versions (looks to be about 45sec in between replacements)
 * Took roughly 3.5 hours to get OOM, likely due to slower loop execution



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to