jpatra72 opened a new issue, #50188: URL: https://github.com/apache/arrow/issues/50188
### Describe the bug, including details regarding any error messages, version, and platform. In `pyarrow==24.0.0`, a process that creates a `pyarrow.fs.S3FileSystem` and keeps a reference to it until interpreter shutdown will deadlock during `Py_FinalizeEx`. The hang is in the `ensure_s3_finalized` atexit handler registered by `pyarrow/fs.py`, inside `Aws::Crt::Io::ClientBootstrap::~ClientBootstrap()`. The same code exits cleanly on `pyarrow==23.0.0`. This is the AWS-CRT "blocking-shutdown" deadlock that aws-sdk-cpp documents in [aws/aws-sdk-cpp#2769](https://github.com/aws/aws-sdk-cpp/issues/2769) — but it is reachable from pure-Python pyarrow without any explicit S3 I/O, just from holding a `S3FileSystem` Python reference past interpreter shutdown. Arrow's atexit-driven finalize ordering plus the AWS C/C++ stack version bumps in 24.0.0 turn this from a quiet teardown into an indefinite hang. #### Reproducer ```python # bug.py import pyarrow.fs s3 = pyarrow.fs.S3FileSystem() print("S3FileSystem created, exiting...") ``` ``` $ python bug.py S3FileSystem created, exiting... # process never exits; SIGTERM required ``` No S3 request is issued — just constructing `S3FileSystem()` and holding the reference is enough. #### Diagnostic: `del s3` fixes it; `pyarrow.fs.finalize_s3()` does **not** ```python # fixed.py — clean exit import pyarrow.fs s3 = pyarrow.fs.S3FileSystem() print("S3FileSystem created, exiting...") del s3 ``` ```python # still_hangs.py — same deadlock, just earlier in the script import pyarrow.fs s3 = pyarrow.fs.S3FileSystem() pyarrow.fs.finalize_s3() # hangs here instead of at atexit ``` That asymmetry pins the cause to "live `S3Client` reference at the moment `Aws::ShutdownAPI` runs," not to a missed wakeup or a broken finalizer. #### gdb backtrace — main thread (the deadlock victim) ``` Thread 1 (Thread 0x7ff7151f5740 (LWP 3685) "python"): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 #1 0x00007ff70fe02671 in std::__atomic_futex_unsigned_base::_M_futex_wait_until (...) from /lib/x86_64-linux-gnu/libstdc++.so.6 #2 0x00007ff7114d8f01 in Aws::Crt::Io::ClientBootstrap::~ClientBootstrap() from pyarrow/libarrow.so.2400 #3 0x00007ff711475b91 in Aws::SetDefaultClientBootstrap(...) from pyarrow/libarrow.so.2400 #4 0x00007ff711475bc9 in Aws::CleanupCrt() from pyarrow/libarrow.so.2400 #5 0x00007ff7114735a5 in Aws::ShutdownAPI(...) from pyarrow/libarrow.so.2400 #6 0x00007ff710a04a28 in arrow::fs::EnsureS3Finalized() from pyarrow/libarrow.so.2400 #7 0x00007ff6fe6693c0 in __pyx_pw_7pyarrow_5_s3fs_7ensure_s3_finalized(...) from pyarrow/_s3fs.cpython-312-x86_64-linux-gnu.so #8 0x000055bdcca34c2c in atexit_callfuncs at Modules/atexitmodule.c:137 #9 0x000055bdcca22213 in _PyAtExit_Call at Modules/atexitmodule.c:157 #10 Py_FinalizeEx () at Python/pylifecycle.c:1927 #11 0x000055bdcca30920 in Py_RunMain () at Modules/main.c:716 #12 0x000055bdcc9ea477 in Py_BytesMain (...) at Modules/main.c:768 ``` #### gdb backtrace — the AWS CRT event-loop thread ``` Thread 66 (Thread 0x7ff661ff0640 (LWP 3751) "AwsEventLoop1"): #0 0x00007ff71531feae in epoll_wait (epfd=4, ..., timeout=100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30 #1 0x00007ff71153cf1a in aws_event_loop_thread () from pyarrow/libarrow.so.2400 #2 0x00007ff7115fe639 in thread_fn () from pyarrow/libarrow.so.2400 #3 0x00007ff71528eac3 in start_thread (...) at pthread_create.c:442 #4 0x00007ff71531fa84 in clone () at clone.S:100 ``` The `epoll_wait(timeout=100000)` is the standard idle poll interval of `aws-c-io`'s Linux epoll event loop ([`DEFAULT_TIMEOUT = 100 * 1000`](https://github.com/awslabs/aws-c-io/blob/v0.26.3/source/linux/epoll_event_loop.c)). It is *not* the bug signal on its own. The real bug signal is that nothing wrote to the loop's wake pipe/eventfd to ask it to stop — because the underlying `aws_client_bootstrap`'s C-side refcount never reached zero. #### Root cause analysis `~ClientBootstrap()` in `aws-crt-cpp` is: ```cpp aws_client_bootstrap_release(m_bootstrap); if (m_enableBlockingShutdown) { // If your program is stuck here, stop using EnableBlockingShutdown() m_shutdownFuture.wait(); } ``` ([aws-crt-cpp v0.38.0 source/io/Bootstrap.cpp](https://github.com/awslabs/aws-crt-cpp/blob/v0.38.0/source/io/Bootstrap.cpp)) `Aws::InitAPI()` (called by Arrow's `InitializeS3`) unconditionally calls `clientBootstrap->EnableBlockingShutdown()` ([aws-sdk-cpp 1.11.594 Aws.cpp:90](https://github.com/aws/aws-sdk-cpp/blob/1.11.594/src/aws-cpp-sdk-core/source/Aws.cpp#L90)), so the destructor always takes the blocking path. The libstdc++ frame in our stack is `std::future::wait()` on `m_shutdownFuture` — a promise fulfilled by the C-layer's `on_shutdown_complete` callback, which only fires once the C `aws_client_bootstrap` reaches refcount zero. The Python reference `s3` keeps `pyarrow._s3fs.S3FileSystem` alive → keeps the C++ `shared_ptr<S3Client>` alive → keeps the underlying `aws_s3_client` alive → which holds a strong reference on `aws_client_bootstrap`. So when `Aws::CleanupCrt()` drops Arrow's default `shared_ptr<ClientBootstrap>` (the only thing the C++ wrapper destructor releases is *its own* one C-side ref), the C bootstrap still has refcount > 0, `on_shutdown_complete` never fires, and the main thread futex-waits indefinitely. Meanwhile the event-loop thread sits at the top of its 100 s idle poll, never told to stop. #### Why 23.0.0 worked and 24.0.0 doesn't Arrow's `s3fs.cc` finalize code is essentially identical between 23.0.0 and 24.0.0; the change is in the bundled AWS C/C++ stack ([`cpp/thirdparty/versions.txt`](https://github.com/apache/arrow/blob/apache-arrow-24.0.0/cpp/thirdparty/versions.txt)): | component | 23.0.0 | 24.0.0 | |---|---|---| | aws-crt-cpp | 0.32.8 | **0.38.0** | | aws-c-io | 0.19.1 | **0.26.3** | | aws-c-s3 | 0.8.1 | **0.12.0** | The `EnableBlockingShutdown` + `m_shutdownFuture.wait()` pattern itself has not changed across these versions, but the ref-graph / teardown-task scheduling in `aws-c-io` between 0.19 and 0.26 changed enough that a previously-fast-but-still-incorrect shutdown now reliably deadlocks when any `S3Client` reference is alive. #### Suggested fix directions 1. **Arrow side (preferred):** at the top of `arrow::fs::FinalizeS3` (before `Aws::ShutdownAPI`), drop Arrow's internal `S3Client` cache (`S3ClientFinalizer`) and ensure no `S3FileSystem`-owned `shared_ptr<S3Client>` survives. Today the finalizer does call into `S3ClientFinalizer::Finalize()`, but any `S3FileSystem` instance still reachable from Python keeps `S3ClientHolder` alive past that point. Either: - hold `S3Client` via `weak_ptr` inside `S3FileSystem` and re-resolve through the finalizer, or - have the atexit handler eagerly walk known `S3FileSystem` instances (via a `weak_ptr` registry) and reset their clients before calling `ShutdownAPI`. 2. **Bypass the blocking destructor:** consider whether Arrow needs `Aws::InitAPI` defaults at all — building `SDKOptions` with a custom `ClientBootstrap` that doesn't call `EnableBlockingShutdown()` would convert this from a hang into a benign leak (matching the warning in aws-crt-cpp's own source). 3. **At minimum, document the footgun.** Today there is no docstring warning that holding a `pyarrow.fs.S3FileSystem` past interpreter shutdown will hang the process on `pyarrow>=24`. #### Related - [aws/aws-sdk-cpp#2769 — `ShutdownAPI` hangs when client outlives `ShutdownAPI`](https://github.com/aws/aws-sdk-cpp/issues/2769) (closed; identical stack signature, root cause documented) - [aws-crt-cpp v0.38.0 `~ClientBootstrap` source comment](https://github.com/awslabs/aws-crt-cpp/blob/v0.38.0/source/io/Bootstrap.cpp) — "If your program is stuck here, stop using EnableBlockingShutdown()" - [Arrow PR #38375 (GH-38364)](https://github.com/apache/arrow/pull/38375) — atexit registration for `ensure_s3_finalized` (the path that fires the deadlock) ### Component(s) Python, C++ --- *Note: I captured the gdb dump and verified the reproducer (including the `del s3` vs `finalize_s3()` asymmetry) on a real environment; the write-up and cross-references to aws-sdk-cpp / aws-crt-cpp source were drafted with help from Claude. Happy to provide the full gdb log, a Dockerfile repro, or additional traces on request.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
