Re: Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Robert Engels Fri, 23 Jan 2026 22:06:41 -0800

No. You are incorrect. The same rules against pooling apply to Go as well. Only 
for VERY expensive objects should you pool - most bulk memory allocations are 
short lived bump allocations.


> On Jan 23, 2026, at 11:55 PM, Jianbin Chen <[email protected]> wrote:
> 
> Hi Francesco,
> 
> I modified my example as follows:
> 
> ```java
> public static void main(String[] args) throws InterruptedException {
>     Executor executor = Executors.newVirtualThreadPerTaskExecutor();
>     Executor executor2 = new ThreadPoolExecutor(200, Integer.MAX_VALUE, 0L, 
> java.util.concurrent.TimeUnit.SECONDS,
>         new SynchronousQueue<>(), Thread.ofVirtual().factory());
>     for (int i = 0; i < 10100; i++) {
>         executor.execute(() -> {
>             try {
>                 Thread.sleep(100);
>             } catch (InterruptedException e) {
>                 throw new RuntimeException(e);
>             }
>         });
>         executor2.execute(() -> {
>             try {
>                 Thread.sleep(100);
>             } catch (InterruptedException e) {
>                 throw new RuntimeException(e);
>             }
>         });
>     }
>     Thread.sleep(5000);
>     long start = System.currentTimeMillis();
>     CountDownLatch countDownLatch = new CountDownLatch(5000000);
>     for (int i = 0; i < 5000000; i++) {
>         executor.execute(() -> {
>             try {
>                 Thread.sleep(100);
>                 countDownLatch.countDown();
>             } catch (InterruptedException e) {
>                 throw new RuntimeException(e);
>             }
>         });
>     }
>     countDownLatch.await();
>     System.out.println("thread time: " + (System.currentTimeMillis() - start) 
> + " ms");
>     start = System.currentTimeMillis();
>     CountDownLatch countDownLatch2 = new CountDownLatch(5000000);
>     for (int i = 0; i < 5000000; i++) {
>         executor2.execute(() -> {
>             try {
>                 Thread.sleep(100);
>                 countDownLatch2.countDown();
>             } catch (InterruptedException e) {
>                 throw new RuntimeException(e);
>             }
>         });
>     }
>     countDownLatch.await();
>     System.out.println("thread pool time: " + (System.currentTimeMillis() - 
> start) + " ms");
> }
> ```
> 
> I constructed the Executor directly with 
> Executors.newVirtualThreadPerTaskExecutor(); 
> however, the run results still show that the pooled virtual‑thread behavior 
> outperforms the non‑pooled virtual threads.
> 
> 
> Francesco Nigro <[email protected] <mailto:[email protected]>> 
> 于2026年1月23日周五 23:39写道：
>> I would say, yes:
>> https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/lang/ThreadBuilders.java#L317
>> unless the fix will be backported - surely @Andrew Haley
>> <[email protected] <mailto:[email protected]>> or 
>> @Alan Bateman <[email protected] <mailto:[email protected]>>
>>  knows
>> 
>> Il giorno ven 23 gen 2026 alle ore 16:32 Jianbin Chen <[email protected] 
>> <mailto:[email protected]>>
>> ha scritto:
>> 
>> > Hi Francesco,
>> >
>> > I'd like to know if there's a similar issue in JDK 21？
>> >
>> > Best Regards.
>> > Jianbin Chen, github-id: funky-eyes
>> >
>> > Francesco Nigro <[email protected] <mailto:[email protected]>> 于 
>> > 2026年1月23日周五 23:14写道：
>> >
>> >> In the original code snippet I see named (with a counter) VThreads, so,
>> >> be aware of https://bugs.openjdk.org/browse/JDK-8372410
>> >>
>> >> Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <[email protected] 
>> >> <mailto:[email protected]>>
>> >> ha scritto:
>> >>
>> >>> I'm sorry — I forgot to mention the machine I used for the load test. My
>> >>> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under 
>> >>> my
>> >>> test load (about 20,000 QPS), with non‑pooled virtual threads you 
>> >>> generate
>> >>> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just 
>> >>> from
>> >>> that 8 KB buffer; that doesn't include other object allocations. With a
>> >>> 2880 MB heap this allocation rate already forces very frequent GC, and
>> >>> frequent GC raises CPU usage, which in turn significantly increases 
>> >>> average
>> >>> response time and p99 / p999 latency.
>> >>>
>> >>> Pooling is usually introduced to solve performance issues — object pools
>> >>> and connection pools exist to quickly reuse cached resources and improve
>> >>> performance. So pooling virtual threads also yields obvious benefits,
>> >>> especially for memory‑constrained, I/O‑bound applications (gateways,
>> >>> proxies, etc.) that are sensitive to latency.
>> >>>
>> >>> Best Regards.
>> >>> Jianbin Chen, github-id: funky-eyes
>> >>>
>> >>> Robert Engels <[email protected] <mailto:[email protected]>> 于 
>> >>> 2026年1月23日周五 22:20写道：
>> >>>
>> >>>> I understand. I was trying explain how you can not use thread locals
>> >>>> and maintain the performance. It’s unlikely allocating a 8k buffer is a
>> >>>> performance bottleneck in a real program if the task is not cpu bound
>> >>>> (depending on the granularity you make your tasks) - but 2M tasks 
>> >>>> running
>> >>>> simultaneously would require 16gb of memory not including the stack.
>> >>>>
>> >>>> You cannot simply use the thread per task model without an
>> >>>> understanding of the cpu, IO, and memory footprints of your tasks and 
>> >>>> then
>> >>>> configure appropriately.
>> >>>>
>> >>>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected] 
>> >>>> <mailto:[email protected]>> wrote:
>> >>>>
>> >>>> 
>> >>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough.
>> >>>> Here's the code in question:
>> >>>>
>> >>>> ```java
>> >>>> Executor executor2 = new ThreadPoolExecutor(
>> >>>>     200,
>> >>>>     Integer.MAX_VALUE,
>> >>>>     0L,
>> >>>>     java.util.concurrent.TimeUnit.SECONDS,
>> >>>>     new SynchronousQueue<>(),
>> >>>>     Thread.ofVirtual().name("test-threadpool-", 1).factory()
>> >>>> );
>> >>>> ```
>> >>>>
>> >>>> In this example, the pooled virtual threads don't implement any
>> >>>> backpressure mechanism; they simply maintain a core pool of 200 virtual
>> >>>> threads. Given that the queue is a `SynchronousQueue` and the maximum 
>> >>>> pool
>> >>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed 
>> >>>> 200,
>> >>>> its behavior becomes identical to that of non-pooled virtual threads.
>> >>>>
>> >>>> From my perspective, this example demonstrates that the benefits of
>> >>>> pooling virtual threads outweigh those of creating a new virtual thread 
>> >>>> for
>> >>>> every single task. In IO-bound scenarios, the virtual threads are 
>> >>>> directly
>> >>>> reused rather than being recreated each time, and the memory footprint 
>> >>>> of
>> >>>> virtual threads is far smaller than that of platform threads (which are
>> >>>> controlled by the `-Xss` flag). Additionally, with pooled virtual 
>> >>>> threads,
>> >>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can
>> >>>> also be reused, which further reduces overall memory usage—wouldn't you
>> >>>> agree?
>> >>>>
>> >>>> Best Regards.
>> >>>> Jianbin Chen, github-id: funky-eyes
>> >>>>
>> >>>> Robert Engels <[email protected] <mailto:[email protected]>> 于 
>> >>>> 2026年1月23日周五 21:52写道：
>> >>>>
>> >>>>> Because VT are so efficient to create, without any back pressure they
>> >>>>> will all be created and running at essentially the same time 
>> >>>>> (dramatically
>> >>>>> raising the amount of memory in use) - versus with a pool of size N you
>> >>>>> will have at most N running at once. In a REAL WORLD application there 
>> >>>>> are
>> >>>>> often external limiters (like number of tcp connections) that provide a
>> >>>>> limit.
>> >>>>>
>> >>>>> If your tasks are purely cpu bound you should probably be using a
>> >>>>> capped thread pool of platform threads as it makes no sense to have 
>> >>>>> more
>> >>>>> threads than available cores.
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected] 
>> >>>>> <mailto:[email protected]>> wrote:
>> >>>>>
>> >>>>> 
>> >>>>> The question is why I need to use a semaphore to control the number of
>> >>>>> concurrently running tasks. In my particular example, the goal is 
>> >>>>> simply to
>> >>>>> keep the concurrency level the same across different thread pool
>> >>>>> implementations so I can fairly compare which one completes all the 
>> >>>>> tasks
>> >>>>> faster. This isn't solely about memory consumption—purely from a
>> >>>>> **performance** perspective (e.g., total throughput or wall-clock time 
>> >>>>> to
>> >>>>> finish the workload), the same number of concurrent tasks completes
>> >>>>> noticeably faster when using pooled virtual threads.
>> >>>>>
>> >>>>> My email probably didn't explain this clearly enough. In reality, I
>> >>>>> have two main questions:
>> >>>>>
>> >>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool
>> >>>>> (e.g., to hold expensive reusable objects like connections, 
>> >>>>> formatters, or
>> >>>>> parsers), is switching to a **pooled virtual thread executor** the only
>> >>>>> viable solution—assuming we cannot modify the third-party library code?
>> >>>>>
>> >>>>> 2. When running the exact same number of concurrent tasks, pooled
>> >>>>> virtual threads deliver better performance.
>> >>>>>
>> >>>>> Both questions point toward the same conclusion: for an application
>> >>>>> originally built around a traditional platform thread pool, after 
>> >>>>> upgrading
>> >>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is 
>> >>>>> generally
>> >>>>> superior to simply using non-pooled (unbounded) virtual threads.
>> >>>>>
>> >>>>> If any part of this reasoning or conclusion is mistaken, I would
>> >>>>> really appreciate being corrected — thank you very much in advance for 
>> >>>>> any
>> >>>>> feedback or different experiences you can share!
>> >>>>>
>> >>>>> Best Regards.
>> >>>>> Jianbin Chen, github-id: funky-eyes
>> >>>>>
>> >>>>> robert engels <[email protected] <mailto:[email protected]>> 于 2026年1月23日周五 
>> >>>>> 20:58写道：
>> >>>>>
>> >>>>>> Exactly, this is your problem. The total number of tasks will all be
>> >>>>>> running at once in the thread per task model.
>> >>>>>>
>> >>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected] 
>> >>>>>> <mailto:[email protected]>> wrote:
>> >>>>>>
>> >>>>>> 
>> >>>>>> Hi Robert,
>> >>>>>>
>> >>>>>> Thanks you, but I'm a bit confused. In the example above, I only set
>> >>>>>> the core pool size to 200 virtual threads, but for the specific test 
>> >>>>>> case
>> >>>>>> we’re talking about, the concurrency isn’t actually being limited by 
>> >>>>>> the
>> >>>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE 
>> >>>>>> and
>> >>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a 
>> >>>>>> new
>> >>>>>> thread gets created to run them right away anyway.
>> >>>>>>
>> >>>>>> Best Regards.
>> >>>>>> Jianbin Chen, github-id: funky-eyes
>> >>>>>>
>> >>>>>> robert engels <[email protected] <mailto:[email protected]>> 于 2026年1月23日周五 
>> >>>>>> 20:28写道：
>> >>>>>>
>> >>>>>>> Try using a semaphore to limit the maximum number of tasks in
>> >>>>>>> progress at anyone time - that is what is causing your memory spike. 
>> >>>>>>> Think
>> >>>>>>> of it this way since VT threads are so cheap to create - you are
>> >>>>>>> essentially creating them all at once - making the working set size 
>> >>>>>>> equally
>> >>>>>>> to the maximum.  So you have N * WSS, where as in the other you have
>> >>>>>>> POOLSIZE * WSS.
>> >>>>>>>
>> >>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected] 
>> >>>>>>> <mailto:[email protected]>>
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>> 
>> >>>>>>> Hi Alan,
>> >>>>>>>
>> >>>>>>> Thanks for your reply and for mentioning JEP 444.
>> >>>>>>> I’ve gone through the guidance in JEP 444 and have some
>> >>>>>>> understanding of it — which is exactly why I’m feeling a bit puzzled 
>> >>>>>>> in
>> >>>>>>> practice and would really like to hear your thoughts.
>> >>>>>>>
>> >>>>>>> Background — ThreadLocal example (Aerospike)
>> >>>>>>> ```java
>> >>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new
>> >>>>>>> ThreadLocal<byte[]>() {
>> >>>>>>>     @Override
>> >>>>>>>     protected byte[] initialValue() {
>> >>>>>>>         return new byte[DefaultBufferSize];
>> >>>>>>>     }
>> >>>>>>> };
>> >>>>>>> ```
>> >>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new
>> >>>>>>> thread is created and stores it in a ThreadLocal for per-thread 
>> >>>>>>> caching.
>> >>>>>>>
>> >>>>>>> My concern
>> >>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[]
>> >>>>>>> instances are effectively reused because threads are long-lived and 
>> >>>>>>> pooled.
>> >>>>>>> - If we switch to creating a brand-new virtual thread per task (no
>> >>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], 
>> >>>>>>> which
>> >>>>>>> leads to many short-lived 8KB allocations.
>> >>>>>>> - That raises allocation rate and GC pressure (despite collectors
>> >>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are
>> >>>>>>> ephemeral.
>> >>>>>>>
>> >>>>>>> So my question is: for applications originally designed around
>> >>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be
>> >>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I 
>> >>>>>>> keep a
>> >>>>>>> pool of 200 virtual threads, then when load exceeds that core size, a
>> >>>>>>> SynchronousQueue will naturally cause new virtual threads to be 
>> >>>>>>> created on
>> >>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based
>> >>>>>>> libraries expect, without losing the ability to expand under spikes. 
>> >>>>>>> Since
>> >>>>>>> virtual threads are very lightweight, pooling a reasonable number 
>> >>>>>>> (e.g.,
>> >>>>>>> 200) seems to have negligible memory downside while retaining 
>> >>>>>>> ThreadLocal
>> >>>>>>> cache effectiveness.
>> >>>>>>>
>> >>>>>>> Empirical test I ran
>> >>>>>>> (I ran a microbenchmark comparing an unpooled per-task
>> >>>>>>> virtual-thread executor and a ThreadPoolExecutor that keeps 200 core
>> >>>>>>> virtual threads.)
>> >>>>>>>
>> >>>>>>> ```java
>> >>>>>>> public static void main(String[] args) throws InterruptedException {
>> >>>>>>>     Executor executor =
>> >>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-",
>> >>>>>>> 1).factory());
>> >>>>>>>     Executor executor2 = new ThreadPoolExecutor(
>> >>>>>>>         200,
>> >>>>>>>         Integer.MAX_VALUE,
>> >>>>>>>         0L,
>> >>>>>>>         java.util.concurrent.TimeUnit.SECONDS,
>> >>>>>>>         new SynchronousQueue<>(),
>> >>>>>>>         Thread.ofVirtual().name("test-threadpool-", 1).factory()
>> >>>>>>>     );
>> >>>>>>>
>> >>>>>>>     // Warm-up
>> >>>>>>>     for (int i = 0; i < 10100; i++) {
>> >>>>>>>         executor.execute(() -> {
>> >>>>>>>             // simulate I/O wait
>> >>>>>>>             try { Thread.sleep(100); } catch (InterruptedException
>> >>>>>>> e) { throw new RuntimeException(e); }
>> >>>>>>>         });
>> >>>>>>>         executor2.execute(() -> {
>> >>>>>>>             // simulate I/O wait
>> >>>>>>>             try { Thread.sleep(100); } catch (InterruptedException
>> >>>>>>> e) { throw new RuntimeException(e); }
>> >>>>>>>         });
>> >>>>>>>     }
>> >>>>>>>
>> >>>>>>>     // Ensure JIT + warm-up complete
>> >>>>>>>     Thread.sleep(5000);
>> >>>>>>>
>> >>>>>>>     long start = System.currentTimeMillis();
>> >>>>>>>     CountDownLatch countDownLatch = new CountDownLatch(50000);
>> >>>>>>>     for (int i = 0; i < 50000; i++) {
>> >>>>>>>         executor.execute(() -> {
>> >>>>>>>             try { Thread.sleep(100); countDownLatch.countDown(); }
>> >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>> >>>>>>>         });
>> >>>>>>>     }
>> >>>>>>>     countDownLatch.await();
>> >>>>>>>     System.out.println("thread time: " + (System.currentTimeMillis()
>> >>>>>>> - start) + " ms");
>> >>>>>>>
>> >>>>>>>     start = System.currentTimeMillis();
>> >>>>>>>     CountDownLatch countDownLatch2 = new CountDownLatch(50000);
>> >>>>>>>     for (int i = 0; i < 50000; i++) {
>> >>>>>>>         executor2.execute(() -> {
>> >>>>>>>             try { Thread.sleep(100); countDownLatch2.countDown(); }
>> >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); }
>> >>>>>>>         });
>> >>>>>>>     }
>> >>>>>>>     countDownLatch.await();
>> >>>>>>>     System.out.println("thread pool time: " +
>> >>>>>>> (System.currentTimeMillis() - start) + " ms");
>> >>>>>>> }
>> >>>>>>> ```
>> >>>>>>>
>> >>>>>>> Result summary
>> >>>>>>> - In my runs, the pooled virtual-thread executor (executor2)
>> >>>>>>> performed better than the unpooled per-task virtual-thread executor.
>> >>>>>>> - Even when I increased load by 10x or 100x, the pooled
>> >>>>>>> virtual-thread executor still showed better performance.
>> >>>>>>> - In realistic workloads, it seems pooling some virtual threads
>> >>>>>>> reduces allocation/GC overhead and improves throughput compared to 
>> >>>>>>> strictly
>> >>>>>>> unpooled virtual threads.
>> >>>>>>>
>> >>>>>>> Final thought / request for feedback
>> >>>>>>> - From my perspective, for systems originally tuned for
>> >>>>>>> platform-thread pools, partially pooling virtual threads seems to 
>> >>>>>>> have no
>> >>>>>>> obvious downside and can restore ThreadLocal cache effectiveness 
>> >>>>>>> used by
>> >>>>>>> many third-party libraries.
>> >>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread
>> >>>>>>> semantics, or ThreadLocal behavior, please point out what I’m 
>> >>>>>>> missing. I’d
>> >>>>>>> appreciate your guidance.
>> >>>>>>>
>> >>>>>>> Best Regards.
>> >>>>>>> Jianbin Chen, github-id: funky-eyes
>> >>>>>>>
>> >>>>>>> Alan Bateman <[email protected] 
>> >>>>>>> <mailto:[email protected]>> 于 2026年1月23日周五 17:27写道：
>> >>>>>>>
>> >>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote:
>> >>>>>>>> > :
>> >>>>>>>> >
>> >>>>>>>> > So my question is:
>> >>>>>>>> >
>> >>>>>>>> > **In scenarios where third-party libraries heavily rely on
>> >>>>>>>> ThreadLocal
>> >>>>>>>> > for caching / buffering (and we cannot change those libraries to
>> >>>>>>>> use
>> >>>>>>>> > object pools instead), is explicitly pooling virtual threads
>> >>>>>>>> (using a
>> >>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a
>> >>>>>>>> > recommended / acceptable workaround?**
>> >>>>>>>> >
>> >>>>>>>> > Or are there better / more idiomatic ways to handle this kind of
>> >>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when
>> >>>>>>>> > migrating to virtual threads?
>> >>>>>>>> >
>> >>>>>>>> > I have already opened a related discussion in the Dubbo project
>> >>>>>>>> (since
>> >>>>>>>> > Dubbo is one of the libraries affected in our stack):
>> >>>>>>>> >
>> >>>>>>>> > https://github.com/apache/dubbo/issues/16042
>> >>>>>>>> >
>> >>>>>>>> > Would love to hear your thoughts — especially from people who
>> >>>>>>>> have
>> >>>>>>>> > experience running large-scale virtual-thread-based services with
>> >>>>>>>> > mixed third-party dependencies.
>> >>>>>>>> >
>> >>>>>>>>
>> >>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual
>> >>>>>>>> threads
>> >>>>>>>> and to avoid caching costing resources in thread locals. Virtual
>> >>>>>>>> threads
>> >>>>>>>> support thread locals of course but that is not useful when some
>> >>>>>>>> library
>> >>>>>>>> is looking to share a costly resource between tasks that run on the
>> >>>>>>>> same
>> >>>>>>>> thread in a thread pool.
>> >>>>>>>>
>> >>>>>>>> I don't know anything about Aerospike but working with the
>> >>>>>>>> maintainers
>> >>>>>>>> of that library to re-work its buffer management seems like the
>> >>>>>>>> right
>> >>>>>>>> course of action here. Your mail says "byte buffers". If this is
>> >>>>>>>> ByteBuffer it might be that they are caching direct buffers as they
>> >>>>>>>> are
>> >>>>>>>> expensive to create (and managed by the GC). Maybe they could look
>> >>>>>>>> at
>> >>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory
>> >>>>>>>> segment) and allocate from an arena that better matches the
>> >>>>>>>> lifecycle.
>> >>>>>>>>
>> >>>>>>>> Hopefully others will share their experiences with migration as it
>> >>>>>>>> is
>> >>>>>>>> indeed challenging to migrate code developed for thread pools to
>> >>>>>>>> work
>> >>>>>>>> efficiently on virtual threads where there is 1-1 relationship
>> >>>>>>>> between
>> >>>>>>>> the task to execute and the thread.
>> >>>>>>>>
>> >>>>>>>> -Alan
>> >>>>>>>>
>> >>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables
>> >>>>>>>>
>> >>>>>>>

Re: Performance Issues with Virtual Threads + ThreadLocal Caching in Third-Party Libraries (JDK 25)

Reply via email to