andygrove opened a new issue, #4200:
URL: https://github.com/apache/datafusion-comet/issues/4200

   ## Description
   
   Recurring JVM crash on \`macos-14/Spark 4.1, JDK 17, Scala 2.13 [parquet]\` 
(and occasionally other macOS PR-build jobs) after the one 
\`ParquetReadFromFakeHadoopFsSuite\` test completes. Reproduced on at least PRs 
#4197 and earlier runs.
   
   Same failure shape as closed #2354 (\`hdfsThreadDestructor\` on linux 
amd64), but here on macOS aarch64 the offending frame is anonymous.
   
   ### \`hs_err\` summary
   
   \`\`\`
   SIGBUS (0xa) at pc=0x000000012e828e00
   siginfo: si_signo: 10 (SIGBUS), si_code: 1 (BUS_ADRALN), si_addr: 
0x000000012e828e00
   Current thread is native thread
   
   Native frames:
   C  0x000000012e828e00                                      ← 
unmapped/stripped
   C  [libsystem_pthread.dylib+0x4818]  _pthread_tsd_cleanup+0x1e8
   C  [libsystem_pthread.dylib+0x762c]  _pthread_exit+0x54
   C  [libsystem_pthread.dylib+0x6f48]  _pthread_start+0x94
   
   Registers (selected):
    pc=0x000000012e828e00    x8=0x000000012e828e00    ← callee == pc
   \`\`\`
   
   ### Root cause (suspected)
   
   Classic **\`pthread_key_create\` TSD destructor called on dlclose'd code** 
pattern:
   
   1. libcomet (or a library it pulls in — \`hdfs-opendal\` / libhdfs) calls 
\`pthread_key_create(&key, destructor_fn)\` for cleanup on thread exit.
   2. \`ParquetReadFromFakeHadoopFsSuite\` runs, spawns hdfs worker threads.
   3. The one test finishes (\`931 ms\` in the latest run); hdfs background 
threads finish their work and call \`_pthread_exit\`.
   4. \`_pthread_tsd_cleanup\` walks the TSD key table and jumps to 
\`destructor_fn\`.
   5. By this point the page holding \`destructor_fn\` has been unmapped / the 
lib has been unloaded, so the fetch at \`pc\` raises \`BUS_ADRALN\`.
   
   The stack \`_pthread_start → _pthread_exit → _pthread_tsd_cleanup → 
<stale>\` plus \`pc == x8\` (the TSD cleanup loop stores the destructor in 
\`x8\` before \`blr x8\` on arm64) is the tell.
   
   ### Where the stale destructor comes from
   
   The suite depends on the \`hdfs-opendal\` feature 
(\`assume(isFeatureEnabled("hdfs-opendal"))\`). On macOS aarch64 CI that 
feature is enabled, so every run exercises the JNI bridge to Hadoop native 
libs. Those libs are the most likely registrars of the TSD key (cf. the 
original #2354 crash that pointed at \`hdfsThreadDestructor+0x61\`).
   
   ### Mitigations to consider
   
   - Skip \`ParquetReadFromFakeHadoopFsSuite\` on macOS aarch64 until the root 
cause is fixed.
   - Unregister TSD keys at library-unload time, or avoid dlclose-like paths 
when TSD destructors are registered.
   - Upstream fix in whichever hdfs binding registers the key (mirrors #2354's 
hdfsThreadDestructor).
   
   Linking PR #4197 where this most recently surfaced.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to