eddieran opened a new pull request, #16195:
URL: https://github.com/apache/dubbo/pull/16195

   ## What is the purpose of the change
   
   Fixes #16194
   
   `JVMUtil.jstack()` calls `ThreadMXBean.dumpAllThreads(true, true)`. The 
`lockedSynchronizers=true` parameter forces the JVM to scan the **entire Java 
heap** at a safepoint to find all `AbstractOwnableSynchronizer` instances. On 
ZGC with large heaps, this causes catastrophic safepoint pauses (**36–39 
seconds** measured on a 65GB heap with ~1950 threads) that freeze the entire 
application.
   
   This PR changes `lockedSynchronizers` from `true` to `false`, eliminating 
the heap scan.
   
   ### Root Cause
   
   | GC | Heap Scan Mechanism | 65GB Scan Time |
   |----|---------------------|----------------|
   | G1 / Parallel | Direct pointer access | ~1–3s |
   | **ZGC** | Load barrier on every reference (color bit check → relocate → 
forwarding table → remap) | **~37s** |
   
   The OpenJDK community already fixed this on the tooling side 
([JDK-8324066](https://bugs.openjdk.org/browse/JDK-8324066): "clhsdb jstack 
should not scan for j.u.c locks by default"), but the programmatic API 
(`ThreadMXBean.dumpAllThreads`) has no such protection.
   
   ### Production Impact
   
   When `AbortPolicyWithReport` fires on ZGC + large heap:
   1. `dumpAllThreads(true, true)` → 37s full application freeze
   2. Queued requests immediately exhaust pool on release → cascading freezes
   3. Observed: **4 consecutive dumps → ~150s near-total service 
unavailability**
   
   ## Brief changelog
   
   - `JVMUtil.jstack()`: Change `dumpAllThreads(true, true)` to 
`dumpAllThreads(true, false)`
   
   ## What is lost
   
   Only the "Locked synchronizers" section at the bottom of each thread's dump 
— i.e., `java.util.concurrent.locks.ReentrantLock` / `ReadWriteLock` ownership. 
All other diagnostic info is retained:
   
   | Information | Retained? |
   |---|---|
   | Thread name, ID, state | Yes |
   | Full stack traces | Yes |
   | `synchronized` block contention (BLOCKED on ...) | Yes |
   | `synchronized` monitor ownership (- locked ...) | Yes |
   | Waiting/parking state | Yes |
   | Lock owner for BLOCKED threads | Yes |
   | **j.u.c.locks ownership (ReentrantLock, etc.)** | **No** |
   
   ## Verifying this change
   
   Existing tests pass — all tests in `AbortPolicyWithReportTest` mock the 
`jstack()` method and are not affected by the parameter change.
   
   The fix can be verified by:
   1. Deploy a Dubbo app with ZGC + large heap (≥32GB)
   2. Exhaust the thread pool to trigger `AbortPolicyWithReport`
   3. Observe safepoint duration in GC logs — should drop from ~37s to <100ms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to