eddieran opened a new issue, #16194:
URL: https://github.com/apache/dubbo/issues/16194

   ### Pre-check
   
   - [X] I am sure that all the content I provide is in English.
   - [X] I had searched in the 
[issues](https://github.com/apache/dubbo/issues?q=is%3Aissue) and found no 
similar issues.
   
   ### Apache Dubbo Component
   
   Java SDK (apache/dubbo)
   
   ### Dubbo Version
   
   Dubbo Java 3.3.6, OpenJDK 21.0.5+11-LTS, Ubuntu (ZGC Generational: 
`-XX:+UseZGC -XX:+ZGenerational`, Heap: `-Xms65g -Xmx65g`)
   
   ### Steps to reproduce this issue
   
   When the Dubbo thread pool is exhausted, 
`AbortPolicyWithReport.rejectedExecution()` invokes `dumpJStack()` → 
`JVMUtil.jstack()` → `ThreadMXBean.dumpAllThreads(true, true)`.
   
   The second `true` parameter (`lockedSynchronizers`) forces the JVM to scan 
the **entire Java heap** at a safepoint to find all 
`AbstractOwnableSynchronizer` instances via 
`HeapInspection::find_instances_at_safepoint()`. On ZGC with large heaps, every 
object reference during this scan passes through ZGC's load barrier (color bit 
check → potential relocate → forwarding table lookup → remap), making this 
operation **12–20x slower** than on G1/Parallel.
   
   #### Environment
   - **JDK**: OpenJDK 21.0.5+11-LTS
   - **GC**: ZGC Generational (`-XX:+UseZGC -XX:+ZGenerational`)
   - **Heap**: 65GB (`-Xms65g -Xmx65g`)
   - **Threads**: ~1950
   - **Dubbo version**: 3.3.6
   
   #### Reproduction Steps
   
   1. Deploy a Dubbo application with ZGC and a large heap (≥32GB)
   2. Drive enough traffic to exhaust the Dubbo thread pool
   3. `AbortPolicyWithReport.rejectedExecution()` fires and calls `dumpJStack()`
   4. Observe a **~37-second full application freeze** in GC logs (safepoint 
duration)
   
   #### Relevant Code
   
   [`JVMUtil.java` Line 
36](https://github.com/apache/dubbo/blob/3.3/dubbo-common/src/main/java/org/apache/dubbo/common/utils/JVMUtil.java#L36):
   
   ```java
   public static void jstack(OutputStream stream) throws Exception {
       ThreadMXBean threadMxBean = ManagementFactory.getThreadMXBean();
       for (ThreadInfo threadInfo : threadMxBean.dumpAllThreads(true, true)) {
           //                                                         ^^^^ 
lockedSynchronizers=true triggers heap scan
           
stream.write(getThreadDumpString(threadInfo).getBytes(StandardCharsets.UTF_8));
       }
   }
   ```
   
   ### What you expected to happen
   
   `JVMUtil.jstack()` should complete in **milliseconds to low seconds** 
without causing a significant safepoint pause, regardless of GC algorithm or 
heap size.
   
   **What actually happens:** A **36.83-second safepoint pause** freezes all 
~1950 application threads. The freeze cascades: queued requests during the 
pause immediately exhaust the thread pool again on release, triggering another 
dump. We observed **4 consecutive ThreadDumps producing ~150 seconds of 
near-total service unavailability**.
   
   #### GC Log Evidence (safepoint entries)
   
   | Event | Timestamp | At Safepoint Duration | Thread Count |
   |-------|-----------|----------------------|--------------|
   | ThreadDump #1 | 10:41:04.466 | 38,589,643,169 ns (**38.59s**) | 1954 |
   | ThreadDump #2 | 10:51:15.645 | 36,833,709,974 ns (**36.83s**) | 1947 |
   
   For comparison, normal ZGC safepoint operations (Mark Start, Mark End, 
Relocate Start) complete in **0.1–0.8ms**.
   
   #### async-profiler Wall-Mode Stack Trace (VM Thread during freeze)
   
   ```
   VM_ThreadDump::doit()
     → ConcurrentLocksDump::dump_at_safepoint()
       → HeapInspection::find_instances_at_safepoint()
         → ZHeap::object_iterate()
           → ZHeapIterator::drain_and_steal<true>()
             → FindInstanceClosure::do_object()
               → ZBarrier::load_barrier_on_oop_field_preloaded
   ```
   
   ZGC barrier frames account for **55%** of VM Thread samples during the 
freeze:
   
   | Frame | % of VM Thread samples |
   |-------|----------------------|
   | `FindInstanceClosure::do_object()` | 25.0% |
   | `ZBarrier::load_barrier_on_oop_field_preloaded` | 19.8% |
   | `OopOopIterateDispatch::oop_oop_iterate` | 17.0% |
   | `ZBarrierSet::AccessBarrier::oop_access_barrier` | 8.4% |
   
   ### Anything else
   
   #### Root Cause Analysis
   
   `dumpAllThreads(lockedMonitors, lockedSynchronizers)` with 
`lockedSynchronizers=true` requires the JVM to find all heap instances of 
`AbstractOwnableSynchronizer`. This is implemented via 
`HeapInspection::find_instances_at_safepoint()`, which iterates the **entire 
heap** at a safepoint.
   
   - On traditional GCs (G1, Parallel): heap iteration uses direct pointer 
access — a 65GB heap scan takes ~1–3 seconds.
   - On ZGC: every object reference passes through the load barrier (color bit 
check, potential relocate, forwarding table lookup), inflating the cost by 
**12–20x** to ~37 seconds.
   
   This is a known JVM-level characteristic. The OpenJDK community has 
addressed it on the tooling side:
   - [JDK-8262098](https://bugs.openjdk.org/browse/JDK-8262098): "jhsdb jstack 
can be very slow" — identified heap scan for `AbstractOwnableSynchronizer` as 
expensive
   - [JDK-8324066](https://bugs.openjdk.org/browse/JDK-8324066): "clhsdb jstack 
should not by default scan for j.u.c locks" — fixed by disabling lock scanning 
by default
   
   However, the **programmatic API** (`ThreadMXBean.dumpAllThreads`) has no 
such protection — callers passing `lockedSynchronizers=true` still trigger the 
full heap scan.
   
   #### Suggested Fix
   
   **Option A (minimal, recommended):** Change `dumpAllThreads(true, true)` to 
`dumpAllThreads(true, false)`:
   
   ```java
   // Before
   ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, true);
   
   // After — avoids heap scan for AbstractOwnableSynchronizer
   ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, false);
   ```
   
   This retains locked monitor information (cheap — derived from thread stacks) 
but skips `lockedSynchronizers` (triggers expensive heap scan). Lock contention 
on `synchronized` blocks remains fully visible; only 
`java.util.concurrent.locks` ownership is lost from the dump output.
   
   **Option B (adaptive):** Detect the GC type and heap size at runtime; only 
pass `lockedSynchronizers=true` when the cost is acceptable (e.g., G1/Parallel 
with heap < 16GB).
   
   **Option C (JDK 21+):** Use `jcmd <pid> Thread.dump_to_file` equivalent or 
`HotSpotDiagnosticMXBean` for thread dumps that do not require a safepoint heap 
scan.
   
   #### Cascade Effect
   
   The existing rate-limiting fix (#14468) prevents repeated dumps within 10 
minutes but does **not** reduce the per-dump cost. When a single dump takes 37 
seconds, the cascade is:
   1. Thread pool exhausted → `dumpAllThreads(true, true)` → 37s freeze
   2. During freeze, all incoming requests queue up
   3. On safepoint release, the queued requests immediately exhaust the pool 
again
   4. If the 10-minute window has passed (or on first occurrence), another 37s 
freeze
   
   In our production incident, this produced ~150 seconds of near-total 
unavailability.
   
   #### Related Issues
   
   - #6696 — `AbortPolicyWithReport` causes JVM pause (same symptom, reported 
in 2020 for Dubbo 2.6.8, no resolution)
   - #14467 / #14468 — Fix repeated jstack when thread pool exhausted 
(addresses frequency, not per-call cost)
   - [JDK-8262098](https://bugs.openjdk.org/browse/JDK-8262098) — jhsdb jstack 
can be very slow
   - [JDK-8324066](https://bugs.openjdk.org/browse/JDK-8324066) — clhsdb jstack 
should not scan for j.u.c locks by default
   - [JDK-8221153](https://bugs.openjdk.org/browse/JDK-8221153) — ZGC heap 
iteration and verification pollutes GC statistics
   
   ### Are you willing to submit a pull request to fix on your own?
   
   - [X] Yes I am willing to submit a pull request on my own!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to