No. You are incorrect. The same rules against pooling apply to Go as well. Only for VERY expensive objects should you pool - most bulk memory allocations are short lived bump allocations.
> On Jan 23, 2026, at 11:55 PM, Jianbin Chen <[email protected]> wrote: > > Hi Francesco, > > I modified my example as follows: > > ```java > public static void main(String[] args) throws InterruptedException { > Executor executor = Executors.newVirtualThreadPerTaskExecutor(); > Executor executor2 = new ThreadPoolExecutor(200, Integer.MAX_VALUE, 0L, > java.util.concurrent.TimeUnit.SECONDS, > new SynchronousQueue<>(), Thread.ofVirtual().factory()); > for (int i = 0; i < 10100; i++) { > executor.execute(() -> { > try { > Thread.sleep(100); > } catch (InterruptedException e) { > throw new RuntimeException(e); > } > }); > executor2.execute(() -> { > try { > Thread.sleep(100); > } catch (InterruptedException e) { > throw new RuntimeException(e); > } > }); > } > Thread.sleep(5000); > long start = System.currentTimeMillis(); > CountDownLatch countDownLatch = new CountDownLatch(5000000); > for (int i = 0; i < 5000000; i++) { > executor.execute(() -> { > try { > Thread.sleep(100); > countDownLatch.countDown(); > } catch (InterruptedException e) { > throw new RuntimeException(e); > } > }); > } > countDownLatch.await(); > System.out.println("thread time: " + (System.currentTimeMillis() - start) > + " ms"); > start = System.currentTimeMillis(); > CountDownLatch countDownLatch2 = new CountDownLatch(5000000); > for (int i = 0; i < 5000000; i++) { > executor2.execute(() -> { > try { > Thread.sleep(100); > countDownLatch2.countDown(); > } catch (InterruptedException e) { > throw new RuntimeException(e); > } > }); > } > countDownLatch.await(); > System.out.println("thread pool time: " + (System.currentTimeMillis() - > start) + " ms"); > } > ``` > > I constructed the Executor directly with > Executors.newVirtualThreadPerTaskExecutor(); > however, the run results still show that the pooled virtual‑thread behavior > outperforms the non‑pooled virtual threads. > > > Francesco Nigro <[email protected] <mailto:[email protected]>> > 于2026年1月23日周五 23:39写道: >> I would say, yes: >> https://github.com/openjdk/jdk21/blob/890adb6410dab4606a4f26a942aed02fb2f55387/src/java.base/share/classes/java/lang/ThreadBuilders.java#L317 >> unless the fix will be backported - surely @Andrew Haley >> <[email protected] <mailto:[email protected]>> or >> @Alan Bateman <[email protected] <mailto:[email protected]>> >> knows >> >> Il giorno ven 23 gen 2026 alle ore 16:32 Jianbin Chen <[email protected] >> <mailto:[email protected]>> >> ha scritto: >> >> > Hi Francesco, >> > >> > I'd like to know if there's a similar issue in JDK 21? >> > >> > Best Regards. >> > Jianbin Chen, github-id: funky-eyes >> > >> > Francesco Nigro <[email protected] <mailto:[email protected]>> 于 >> > 2026年1月23日周五 23:14写道: >> > >> >> In the original code snippet I see named (with a counter) VThreads, so, >> >> be aware of https://bugs.openjdk.org/browse/JDK-8372410 >> >> >> >> Il giorno ven 23 gen 2026 alle ore 15:52 Jianbin Chen <[email protected] >> >> <mailto:[email protected]>> >> >> ha scritto: >> >> >> >>> I'm sorry — I forgot to mention the machine I used for the load test. My >> >>> server is 2 cores and 4 GB RAM, and the JVM heap was set to 2880m. Under >> >>> my >> >>> test load (about 20,000 QPS), with non‑pooled virtual threads you >> >>> generate >> >>> at least 20,000 × 8 KB = ~156 MB of byte[] allocations per second just >> >>> from >> >>> that 8 KB buffer; that doesn't include other object allocations. With a >> >>> 2880 MB heap this allocation rate already forces very frequent GC, and >> >>> frequent GC raises CPU usage, which in turn significantly increases >> >>> average >> >>> response time and p99 / p999 latency. >> >>> >> >>> Pooling is usually introduced to solve performance issues — object pools >> >>> and connection pools exist to quickly reuse cached resources and improve >> >>> performance. So pooling virtual threads also yields obvious benefits, >> >>> especially for memory‑constrained, I/O‑bound applications (gateways, >> >>> proxies, etc.) that are sensitive to latency. >> >>> >> >>> Best Regards. >> >>> Jianbin Chen, github-id: funky-eyes >> >>> >> >>> Robert Engels <[email protected] <mailto:[email protected]>> 于 >> >>> 2026年1月23日周五 22:20写道: >> >>> >> >>>> I understand. I was trying explain how you can not use thread locals >> >>>> and maintain the performance. It’s unlikely allocating a 8k buffer is a >> >>>> performance bottleneck in a real program if the task is not cpu bound >> >>>> (depending on the granularity you make your tasks) - but 2M tasks >> >>>> running >> >>>> simultaneously would require 16gb of memory not including the stack. >> >>>> >> >>>> You cannot simply use the thread per task model without an >> >>>> understanding of the cpu, IO, and memory footprints of your tasks and >> >>>> then >> >>>> configure appropriately. >> >>>> >> >>>> On Jan 23, 2026, at 8:10 AM, Jianbin Chen <[email protected] >> >>>> <mailto:[email protected]>> wrote: >> >>>> >> >>>> >> >>>> I'm sorry, Robert—perhaps I didn't explain my example clearly enough. >> >>>> Here's the code in question: >> >>>> >> >>>> ```java >> >>>> Executor executor2 = new ThreadPoolExecutor( >> >>>> 200, >> >>>> Integer.MAX_VALUE, >> >>>> 0L, >> >>>> java.util.concurrent.TimeUnit.SECONDS, >> >>>> new SynchronousQueue<>(), >> >>>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >> >>>> ); >> >>>> ``` >> >>>> >> >>>> In this example, the pooled virtual threads don't implement any >> >>>> backpressure mechanism; they simply maintain a core pool of 200 virtual >> >>>> threads. Given that the queue is a `SynchronousQueue` and the maximum >> >>>> pool >> >>>> size is set to `Integer.MAX_VALUE`, once the concurrent tasks exceed >> >>>> 200, >> >>>> its behavior becomes identical to that of non-pooled virtual threads. >> >>>> >> >>>> From my perspective, this example demonstrates that the benefits of >> >>>> pooling virtual threads outweigh those of creating a new virtual thread >> >>>> for >> >>>> every single task. In IO-bound scenarios, the virtual threads are >> >>>> directly >> >>>> reused rather than being recreated each time, and the memory footprint >> >>>> of >> >>>> virtual threads is far smaller than that of platform threads (which are >> >>>> controlled by the `-Xss` flag). Additionally, with pooled virtual >> >>>> threads, >> >>>> the 8KB `byte[]` cache I mentioned earlier (stored in `ThreadLocal`) can >> >>>> also be reused, which further reduces overall memory usage—wouldn't you >> >>>> agree? >> >>>> >> >>>> Best Regards. >> >>>> Jianbin Chen, github-id: funky-eyes >> >>>> >> >>>> Robert Engels <[email protected] <mailto:[email protected]>> 于 >> >>>> 2026年1月23日周五 21:52写道: >> >>>> >> >>>>> Because VT are so efficient to create, without any back pressure they >> >>>>> will all be created and running at essentially the same time >> >>>>> (dramatically >> >>>>> raising the amount of memory in use) - versus with a pool of size N you >> >>>>> will have at most N running at once. In a REAL WORLD application there >> >>>>> are >> >>>>> often external limiters (like number of tcp connections) that provide a >> >>>>> limit. >> >>>>> >> >>>>> If your tasks are purely cpu bound you should probably be using a >> >>>>> capped thread pool of platform threads as it makes no sense to have >> >>>>> more >> >>>>> threads than available cores. >> >>>>> >> >>>>> >> >>>>> >> >>>>> On Jan 23, 2026, at 7:42 AM, Jianbin Chen <[email protected] >> >>>>> <mailto:[email protected]>> wrote: >> >>>>> >> >>>>> >> >>>>> The question is why I need to use a semaphore to control the number of >> >>>>> concurrently running tasks. In my particular example, the goal is >> >>>>> simply to >> >>>>> keep the concurrency level the same across different thread pool >> >>>>> implementations so I can fairly compare which one completes all the >> >>>>> tasks >> >>>>> faster. This isn't solely about memory consumption—purely from a >> >>>>> **performance** perspective (e.g., total throughput or wall-clock time >> >>>>> to >> >>>>> finish the workload), the same number of concurrent tasks completes >> >>>>> noticeably faster when using pooled virtual threads. >> >>>>> >> >>>>> My email probably didn't explain this clearly enough. In reality, I >> >>>>> have two main questions: >> >>>>> >> >>>>> 1. When a third-party library uses `ThreadLocal` as a cache/pool >> >>>>> (e.g., to hold expensive reusable objects like connections, >> >>>>> formatters, or >> >>>>> parsers), is switching to a **pooled virtual thread executor** the only >> >>>>> viable solution—assuming we cannot modify the third-party library code? >> >>>>> >> >>>>> 2. When running the exact same number of concurrent tasks, pooled >> >>>>> virtual threads deliver better performance. >> >>>>> >> >>>>> Both questions point toward the same conclusion: for an application >> >>>>> originally built around a traditional platform thread pool, after >> >>>>> upgrading >> >>>>> to JDK 21/25, moving to a **pooled virtual threads** approach is >> >>>>> generally >> >>>>> superior to simply using non-pooled (unbounded) virtual threads. >> >>>>> >> >>>>> If any part of this reasoning or conclusion is mistaken, I would >> >>>>> really appreciate being corrected — thank you very much in advance for >> >>>>> any >> >>>>> feedback or different experiences you can share! >> >>>>> >> >>>>> Best Regards. >> >>>>> Jianbin Chen, github-id: funky-eyes >> >>>>> >> >>>>> robert engels <[email protected] <mailto:[email protected]>> 于 2026年1月23日周五 >> >>>>> 20:58写道: >> >>>>> >> >>>>>> Exactly, this is your problem. The total number of tasks will all be >> >>>>>> running at once in the thread per task model. >> >>>>>> >> >>>>>> On Jan 23, 2026, at 6:49 AM, Jianbin Chen <[email protected] >> >>>>>> <mailto:[email protected]>> wrote: >> >>>>>> >> >>>>>> >> >>>>>> Hi Robert, >> >>>>>> >> >>>>>> Thanks you, but I'm a bit confused. In the example above, I only set >> >>>>>> the core pool size to 200 virtual threads, but for the specific test >> >>>>>> case >> >>>>>> we’re talking about, the concurrency isn’t actually being limited by >> >>>>>> the >> >>>>>> pool size at all. Since the maximum thread count is Integer.MAX_VALUE >> >>>>>> and >> >>>>>> it’s using a SynchronousQueue, tasks are handed off immediately and a >> >>>>>> new >> >>>>>> thread gets created to run them right away anyway. >> >>>>>> >> >>>>>> Best Regards. >> >>>>>> Jianbin Chen, github-id: funky-eyes >> >>>>>> >> >>>>>> robert engels <[email protected] <mailto:[email protected]>> 于 2026年1月23日周五 >> >>>>>> 20:28写道: >> >>>>>> >> >>>>>>> Try using a semaphore to limit the maximum number of tasks in >> >>>>>>> progress at anyone time - that is what is causing your memory spike. >> >>>>>>> Think >> >>>>>>> of it this way since VT threads are so cheap to create - you are >> >>>>>>> essentially creating them all at once - making the working set size >> >>>>>>> equally >> >>>>>>> to the maximum. So you have N * WSS, where as in the other you have >> >>>>>>> POOLSIZE * WSS. >> >>>>>>> >> >>>>>>> On Jan 23, 2026, at 4:14 AM, Jianbin Chen <[email protected] >> >>>>>>> <mailto:[email protected]>> >> >>>>>>> wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> Hi Alan, >> >>>>>>> >> >>>>>>> Thanks for your reply and for mentioning JEP 444. >> >>>>>>> I’ve gone through the guidance in JEP 444 and have some >> >>>>>>> understanding of it — which is exactly why I’m feeling a bit puzzled >> >>>>>>> in >> >>>>>>> practice and would really like to hear your thoughts. >> >>>>>>> >> >>>>>>> Background — ThreadLocal example (Aerospike) >> >>>>>>> ```java >> >>>>>>> private static final ThreadLocal<byte[]> BufferThreadLocal = new >> >>>>>>> ThreadLocal<byte[]>() { >> >>>>>>> @Override >> >>>>>>> protected byte[] initialValue() { >> >>>>>>> return new byte[DefaultBufferSize]; >> >>>>>>> } >> >>>>>>> }; >> >>>>>>> ``` >> >>>>>>> This Aerospike code allocates a default 8KB byte[] whenever a new >> >>>>>>> thread is created and stores it in a ThreadLocal for per-thread >> >>>>>>> caching. >> >>>>>>> >> >>>>>>> My concern >> >>>>>>> - With a traditional platform-thread pool, those ThreadLocal byte[] >> >>>>>>> instances are effectively reused because threads are long-lived and >> >>>>>>> pooled. >> >>>>>>> - If we switch to creating a brand-new virtual thread per task (no >> >>>>>>> pooling), each virtual thread gets its own fresh ThreadLocal byte[], >> >>>>>>> which >> >>>>>>> leads to many short-lived 8KB allocations. >> >>>>>>> - That raises allocation rate and GC pressure (despite collectors >> >>>>>>> like ZGC), because ThreadLocal caches aren’t reused when threads are >> >>>>>>> ephemeral. >> >>>>>>> >> >>>>>>> So my question is: for applications originally designed around >> >>>>>>> platform-thread pools, wouldn’t partially pooling virtual threads be >> >>>>>>> beneficial? For example, Tomcat’s default max threads is 200 — if I >> >>>>>>> keep a >> >>>>>>> pool of 200 virtual threads, then when load exceeds that core size, a >> >>>>>>> SynchronousQueue will naturally cause new virtual threads to be >> >>>>>>> created on >> >>>>>>> demand. This seems to preserve the behavior that ThreadLocal-based >> >>>>>>> libraries expect, without losing the ability to expand under spikes. >> >>>>>>> Since >> >>>>>>> virtual threads are very lightweight, pooling a reasonable number >> >>>>>>> (e.g., >> >>>>>>> 200) seems to have negligible memory downside while retaining >> >>>>>>> ThreadLocal >> >>>>>>> cache effectiveness. >> >>>>>>> >> >>>>>>> Empirical test I ran >> >>>>>>> (I ran a microbenchmark comparing an unpooled per-task >> >>>>>>> virtual-thread executor and a ThreadPoolExecutor that keeps 200 core >> >>>>>>> virtual threads.) >> >>>>>>> >> >>>>>>> ```java >> >>>>>>> public static void main(String[] args) throws InterruptedException { >> >>>>>>> Executor executor = >> >>>>>>> Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("test-", >> >>>>>>> 1).factory()); >> >>>>>>> Executor executor2 = new ThreadPoolExecutor( >> >>>>>>> 200, >> >>>>>>> Integer.MAX_VALUE, >> >>>>>>> 0L, >> >>>>>>> java.util.concurrent.TimeUnit.SECONDS, >> >>>>>>> new SynchronousQueue<>(), >> >>>>>>> Thread.ofVirtual().name("test-threadpool-", 1).factory() >> >>>>>>> ); >> >>>>>>> >> >>>>>>> // Warm-up >> >>>>>>> for (int i = 0; i < 10100; i++) { >> >>>>>>> executor.execute(() -> { >> >>>>>>> // simulate I/O wait >> >>>>>>> try { Thread.sleep(100); } catch (InterruptedException >> >>>>>>> e) { throw new RuntimeException(e); } >> >>>>>>> }); >> >>>>>>> executor2.execute(() -> { >> >>>>>>> // simulate I/O wait >> >>>>>>> try { Thread.sleep(100); } catch (InterruptedException >> >>>>>>> e) { throw new RuntimeException(e); } >> >>>>>>> }); >> >>>>>>> } >> >>>>>>> >> >>>>>>> // Ensure JIT + warm-up complete >> >>>>>>> Thread.sleep(5000); >> >>>>>>> >> >>>>>>> long start = System.currentTimeMillis(); >> >>>>>>> CountDownLatch countDownLatch = new CountDownLatch(50000); >> >>>>>>> for (int i = 0; i < 50000; i++) { >> >>>>>>> executor.execute(() -> { >> >>>>>>> try { Thread.sleep(100); countDownLatch.countDown(); } >> >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); } >> >>>>>>> }); >> >>>>>>> } >> >>>>>>> countDownLatch.await(); >> >>>>>>> System.out.println("thread time: " + (System.currentTimeMillis() >> >>>>>>> - start) + " ms"); >> >>>>>>> >> >>>>>>> start = System.currentTimeMillis(); >> >>>>>>> CountDownLatch countDownLatch2 = new CountDownLatch(50000); >> >>>>>>> for (int i = 0; i < 50000; i++) { >> >>>>>>> executor2.execute(() -> { >> >>>>>>> try { Thread.sleep(100); countDownLatch2.countDown(); } >> >>>>>>> catch (InterruptedException e) { throw new RuntimeException(e); } >> >>>>>>> }); >> >>>>>>> } >> >>>>>>> countDownLatch.await(); >> >>>>>>> System.out.println("thread pool time: " + >> >>>>>>> (System.currentTimeMillis() - start) + " ms"); >> >>>>>>> } >> >>>>>>> ``` >> >>>>>>> >> >>>>>>> Result summary >> >>>>>>> - In my runs, the pooled virtual-thread executor (executor2) >> >>>>>>> performed better than the unpooled per-task virtual-thread executor. >> >>>>>>> - Even when I increased load by 10x or 100x, the pooled >> >>>>>>> virtual-thread executor still showed better performance. >> >>>>>>> - In realistic workloads, it seems pooling some virtual threads >> >>>>>>> reduces allocation/GC overhead and improves throughput compared to >> >>>>>>> strictly >> >>>>>>> unpooled virtual threads. >> >>>>>>> >> >>>>>>> Final thought / request for feedback >> >>>>>>> - From my perspective, for systems originally tuned for >> >>>>>>> platform-thread pools, partially pooling virtual threads seems to >> >>>>>>> have no >> >>>>>>> obvious downside and can restore ThreadLocal cache effectiveness >> >>>>>>> used by >> >>>>>>> many third-party libraries. >> >>>>>>> - If I’ve misunderstood JEP 444 recommendations, virtual-thread >> >>>>>>> semantics, or ThreadLocal behavior, please point out what I’m >> >>>>>>> missing. I’d >> >>>>>>> appreciate your guidance. >> >>>>>>> >> >>>>>>> Best Regards. >> >>>>>>> Jianbin Chen, github-id: funky-eyes >> >>>>>>> >> >>>>>>> Alan Bateman <[email protected] >> >>>>>>> <mailto:[email protected]>> 于 2026年1月23日周五 17:27写道: >> >>>>>>> >> >>>>>>>> On 23/01/2026 07:30, Jianbin Chen wrote: >> >>>>>>>> > : >> >>>>>>>> > >> >>>>>>>> > So my question is: >> >>>>>>>> > >> >>>>>>>> > **In scenarios where third-party libraries heavily rely on >> >>>>>>>> ThreadLocal >> >>>>>>>> > for caching / buffering (and we cannot change those libraries to >> >>>>>>>> use >> >>>>>>>> > object pools instead), is explicitly pooling virtual threads >> >>>>>>>> (using a >> >>>>>>>> > ThreadPoolExecutor with virtual thread factory) considered a >> >>>>>>>> > recommended / acceptable workaround?** >> >>>>>>>> > >> >>>>>>>> > Or are there better / more idiomatic ways to handle this kind of >> >>>>>>>> > compatibility issue with legacy ThreadLocal-based libraries when >> >>>>>>>> > migrating to virtual threads? >> >>>>>>>> > >> >>>>>>>> > I have already opened a related discussion in the Dubbo project >> >>>>>>>> (since >> >>>>>>>> > Dubbo is one of the libraries affected in our stack): >> >>>>>>>> > >> >>>>>>>> > https://github.com/apache/dubbo/issues/16042 >> >>>>>>>> > >> >>>>>>>> > Would love to hear your thoughts — especially from people who >> >>>>>>>> have >> >>>>>>>> > experience running large-scale virtual-thread-based services with >> >>>>>>>> > mixed third-party dependencies. >> >>>>>>>> > >> >>>>>>>> >> >>>>>>>> The guidelines that we put in JEP 444 [1] is to not pool virtual >> >>>>>>>> threads >> >>>>>>>> and to avoid caching costing resources in thread locals. Virtual >> >>>>>>>> threads >> >>>>>>>> support thread locals of course but that is not useful when some >> >>>>>>>> library >> >>>>>>>> is looking to share a costly resource between tasks that run on the >> >>>>>>>> same >> >>>>>>>> thread in a thread pool. >> >>>>>>>> >> >>>>>>>> I don't know anything about Aerospike but working with the >> >>>>>>>> maintainers >> >>>>>>>> of that library to re-work its buffer management seems like the >> >>>>>>>> right >> >>>>>>>> course of action here. Your mail says "byte buffers". If this is >> >>>>>>>> ByteBuffer it might be that they are caching direct buffers as they >> >>>>>>>> are >> >>>>>>>> expensive to create (and managed by the GC). Maybe they could look >> >>>>>>>> at >> >>>>>>>> using MemorySegment (it's easy to get a ByteBuffer view of a memory >> >>>>>>>> segment) and allocate from an arena that better matches the >> >>>>>>>> lifecycle. >> >>>>>>>> >> >>>>>>>> Hopefully others will share their experiences with migration as it >> >>>>>>>> is >> >>>>>>>> indeed challenging to migrate code developed for thread pools to >> >>>>>>>> work >> >>>>>>>> efficiently on virtual threads where there is 1-1 relationship >> >>>>>>>> between >> >>>>>>>> the task to execute and the thread. >> >>>>>>>> >> >>>>>>>> -Alan >> >>>>>>>> >> >>>>>>>> [1] https://openjdk.org/jeps/444#Thread-local-variables >> >>>>>>>> >> >>>>>>>
