balodesecurity opened a new pull request, #8290:
URL: https://github.com/apache/hadoop/pull/8290

   ## Problem
   
   When an application runs on a DataNode with short-circuit reads enabled and 
a custom `URLClassLoader` (whose classpath contains remote HDFS JARs) set as 
the thread context ClassLoader, the main thread can hang indefinitely.
   
   **Deadlock chain:**
   
   1. Thread T enters `DfsClientShmManager.EndpointShmManager.allocSlot()`, 
sets `loading = true`, releases the lock, and calls `requestNewShm()`
   2. `requestNewShm()` creates a `DfsClientShm`, whose constructor 
(`ShortCircuitShm`) calls `POSIX.mmap()` — triggering the `NativeIO.POSIX` 
class static initializer
   3. The static initializer calls `new Configuration()`, which loads XML 
resources via the thread's **context ClassLoader**
   4. If the context ClassLoader is a `URLClassLoader` backed by remote HDFS 
JARs, resolving those JARs triggers an HDFS read
   5. That read re-enters `allocSlot()` on **the same thread T**, which 
acquires the lock (since it was released), sees `loading == true`, and calls 
`finishedLoading.awaitUninterruptibly()`
   6. Thread T is now parked waiting for a condition that **it itself** must 
signal → **indefinite hang**
   
   ## Fix
   
   Track which thread set `loading = true` via a new `loadingThread` field in 
`EndpointShmManager`. When `allocSlot()` detects that `loading == true` and the 
current thread **is** the loading thread, it returns `null` immediately instead 
of waiting. The caller then falls back transparently to a normal 
(non-short-circuit) read.
   
   Changes:
   - `DfsClientShmManager.java`: add `loadingThread` field; set/clear it 
alongside `loading`; detect and short-circuit re-entrant calls
   - `TestDfsClientShmManager.java`: regression test that injects the 
re-entrant state via reflection and verifies `null` is returned within a 
10-second timeout (would hang indefinitely before this fix)
   
   ## Test
   
   ```
   mvn test -pl hadoop-hdfs-project/hadoop-hdfs-client \
     -Dtest=TestDfsClientShmManager
   ```
   
   The test requires native Unix domain socket support (`libhadoop`) and 
auto-skips without it (matching the pattern used throughout the shortcircuit 
test suite).
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to