On Mon, 17 Nov 2025 at 12:33, Tomasz Kaminski <[email protected]> wrote: > > > > On Mon, Nov 17, 2025 at 10:10 AM Tomasz Kaminski <[email protected]> wrote: >> >> >> >> On Sun, Nov 16, 2025 at 1:56 AM Jonathan Wakely <[email protected]> wrote: >>> >>> This will allow us to extend atomic waiting functions to support a >>> possible future 64-bit version of futex, as well as supporting >>> futex-like wait/wake primitives on other targets (e.g. macOS has >>> os_sync_wait_on_address and FreeBSD has _umtx_op). >>> >>> Before this change, the decision of whether to do a proxy wait or to >>> wait on the atomic variable itself was made in the header at >>> compile-time, which makes it an ABI property that would not have been >>> possible to change later. That would have meant that >>> std::atomic<uint64_t> would always have to do a proxy wait even if Linux >>> gains support for 64-bit futex2(2) calls at some point in the future. >>> The disadvantage of proxy waits is that several distinct atomic objects >>> can share the same proxy state, leading to contention between threads >>> even when they are not waiting on the same atomic object, similar to >>> false sharing. It also result in spurious wake-ups because doing a >>> notify on an atomic object that uses a proxy wait will wake all waiters >>> sharing the proxy. >>> >>> For types that are known to definitely not need a proxy wait (e.g. int >>> on Linux) the header can still choose a more efficient path at >>> compile-time. But for other types, the decision of whether to do a proxy >>> wait is deferred to runtime, inside the library internals. This will >>> make it possible for future versions of libstdc++.so to extend the set >>> of types which don't need to use proxy waits, without ABI changes. >>> >>> The way the change works is to stop using the __proxy_wait flag that was >>> set by the inline code in the headers. Instead the __wait_args struct >>> has an extra pointer member which the library internals populate with >>> either the address of the atomic object or the _M_ver counter in the >>> proxy state. There is also a new _M_obj_size member which stores the >>> size of the atomic object, so that the library can decide whether a >>> proxy is needed. So for example if linux gains 64-bit futex support then >>> the library can decide not to use a proxy when _M_obj_size == 8. >>> Finally, the _M_old member of the __wait_args struct is changed to >>> uint64_t so that it has room to store 64-bit values, not just whatever >>> size the __platform_wait_t type is (which is a 32-bit int on Linux). >>> Similarly, the _M_val member of __wait_result_type changes to uint64_t >>> too. >>> >>> libstdc++-v3/ChangeLog: >>> >>> * config/abi/pre/gnu.ver: Adjust exports. >>> * include/bits/atomic_timed_wait.h >>> (_GLIBCXX_HAVE_PLATFORM_TIMED_WAIT): >>> Do not define this macro. >>> (__atomic_wait_address_until_v, __atomic_wait_address_for_v): >>> Guard assertions with #ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT. >>> * include/bits/atomic_wait.h (__platform_wait_uses_type): >>> Different separately for platforms with and without platform >>> wait. >>> (_GLIBCXX_HAVE_PLATFORM_WAIT): Do not define this macro. >>> (_GLIBCXX_UNKNOWN_PLATFORM_WAIT): Define new macro. >>> (__wait_value_type): New typedef. >>> (__wait_result_type): Change _M_val to __wait_value_type. >>> (__wait_args_base::_M_old): Change to __wait_args_base. >>> (__wait_args_base::_M_obg, __wait_args_base::_M_obj_size): New >>> data members. >>> (__wait_args::__wait_args): Set _M_obj and _M_obj_size on >>> construction. >>> (__wait_args::_M_setup_wait): Change void* parameter to deduced >>> type. Use _S_bit_cast instead of __builtin_bit_cast. >>> (__wait_args::_M_load_proxy_wait_val): Remove function, replace >>> with ... >>> (__wait_args::_M_setup_wait_impl): New function. >>> (__wait_args::_S_bit_cast): Wrapper for __builtin_bit_cast which >>> also supports conversion from 32-bit values. >>> (__wait_args::_S_flags_for): Do not set __proxy_wait flag. >>> (__atomic_wait_address_v): Guard assertions with #ifdef >>> _GLIBCXX_UNKNOWN_PLATFORM_WAIT. >>> * src/c++20/atomic.cc (_GLIBCXX_HAVE_PLATFORM_WAIT): Define here >>> instead of in header. Check _GLIBCXX_HAVE_PLATFORM_WAIT instead >>> of _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT. >>> (__spin_impl): Adjust for 64-bit __wait_args_base::_M_old. >>> (use_proxy_wait): New function. >>> (__wait_args::_M_load_proxy_wait_val): Replace with ... >>> (__wait_args::_M_setup_wait_impl): New function. Call >>> use_proxy_wait to decide at runtime whether to wait on the >>> pointer directly instead of using a proxy. If a proxy is needed, >>> set _M_obj to point to its _M_ver member. Adjust for change to >>> type of _M_old. >>> (__wait_impl): Wait on _M_obj unconditionally. >>> (__notify_impl): Call use_proxy_wait to decide whether to notify >>> on the address parameter or a proxy >>> (__spin_until_impl): Adjust for change to type of _M_val. >>> (__wait_until_impl): Wait on _M_obj unconditionally. >>> --- >>> >>> Tested x86_64-linux, powerpc64le-linux, sparc-solaris. >> >> A lot of comments below. >>> >>> >>> I think this is an imporant change which I unfortunately didn't think of >>> until recently. >>> >>> This changes the exports from the shared library, but we're still in >>> stage 1 so I think that should be allowed (albeit unfortunate). Nobody >>> should be expecting GCC 16 to be stable yet. >>> >>> The __proxy_wait enumerator is now unused and could be removed. The >>> __abi_version enumerator could also be bumped to indicate the >>> incompatibility with earlier snapshots of GCC 16, but I don't think that >>> is needed. We could in theory keep the old symbol export >>> (__wait_args::_M_load_proxy_wait) and make it trap/abort if called, but >>> I'd prefer to just remove it and cause dynamic linker errors instead. >>> >>> There's a TODO in the header about which types should be allowed to take >>> the optimized paths (see the __waitable concept). For types where that's >>> true, if the size matches a futex then we'll use a futex, even if it's >>> actually an enum or floating-point type (or pointer on 32-bit targets). >>> I'm not sure if that's safe. >>> >>> >>> libstdc++-v3/config/abi/pre/gnu.ver | 3 +- >>> libstdc++-v3/include/bits/atomic_timed_wait.h | 12 +- >>> libstdc++-v3/include/bits/atomic_wait.h | 109 +++++++++----- >>> libstdc++-v3/src/c++20/atomic.cc | 140 +++++++++++------- >>> 4 files changed, 166 insertions(+), 98 deletions(-) >>> >>> diff --git a/libstdc++-v3/config/abi/pre/gnu.ver >>> b/libstdc++-v3/config/abi/pre/gnu.ver >>> index 2e48241d51f9..3c2bd4921730 100644 >>> --- a/libstdc++-v3/config/abi/pre/gnu.ver >>> +++ b/libstdc++-v3/config/abi/pre/gnu.ver >>> @@ -2553,7 +2553,8 @@ GLIBCXX_3.4.35 { >>> _ZNSt8__detail11__wait_implEPKvRNS_16__wait_args_baseE; >>> _ZNSt8__detail13__notify_implEPKvbRKNS_16__wait_args_baseE; >>> >>> _ZNSt8__detail17__wait_until_implEPKvRNS_16__wait_args_baseERKNSt6chrono8durationI[lx]St5ratioIL[lx]1EL[lx]1000000000EEEE; >>> - _ZNSt8__detail11__wait_args22_M_load_proxy_wait_valEPKv; >>> + _ZNSt8__detail11__wait_args18_M_setup_wait_implEPKv; >>> + _ZNSt8__detail11__wait_args20_M_setup_notify_implEPKv; >>> >>> # std::chrono::gps_clock::now, tai_clock::now >>> _ZNSt6chrono9gps_clock3nowEv; >>> diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h >>> b/libstdc++-v3/include/bits/atomic_timed_wait.h >>> index 30f7ff616840..918a267d10eb 100644 >>> --- a/libstdc++-v3/include/bits/atomic_timed_wait.h >>> +++ b/libstdc++-v3/include/bits/atomic_timed_wait.h >>> @@ -75,14 +75,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> return chrono::ceil<__w_dur>(__atime); >>> } >>> >>> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX >>> -#define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT >>> -#else >>> -// define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement >>> __platform_wait_until >>> -// if there is a more efficient primitive supported by the platform >>> -// (e.g. __ulock_wait) which is better than pthread_cond_clockwait. >>> -#endif // ! HAVE_LINUX_FUTEX >>> - >>> __wait_result_type >>> __wait_until_impl(const void* __addr, __wait_args_base& __args, >>> const __wait_clock_t::duration& __atime); >>> @@ -156,7 +148,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> const chrono::time_point<_Clock, _Dur>& >>> __atime, >>> bool __bare_wait = false) noexcept >>> { >>> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT >>> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT >>> __glibcxx_assert(false); // This function can't be used for proxy >>> wait. >>> #endif >>> __detail::__wait_args __args{ __addr, __old, __order, __bare_wait }; >>> @@ -208,7 +200,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> const chrono::duration<_Rep, _Period>& >>> __rtime, >>> bool __bare_wait = false) noexcept >>> { >>> -#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT >>> +#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT >> >> This name really reads strange, and sounds like something with "TODO". >> I think _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT was just OK name, even >> if it was not used directly. >>> >>> __glibcxx_assert(false); // This function can't be used for proxy >>> wait. >>> #endif >>> __detail::__wait_args __args{ __addr, __old, __order, __bare_wait }; >>> diff --git a/libstdc++-v3/include/bits/atomic_wait.h >>> b/libstdc++-v3/include/bits/atomic_wait.h >>> index 95151479c120..49369419d6a6 100644 >>> --- a/libstdc++-v3/include/bits/atomic_wait.h >>> +++ b/libstdc++-v3/include/bits/atomic_wait.h >>> @@ -45,35 +45,34 @@ >>> namespace std _GLIBCXX_VISIBILITY(default) >>> { >>> _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> +#if defined _GLIBCXX_HAVE_LINUX_FUTEX >>> namespace __detail >>> { >>> -#ifdef _GLIBCXX_HAVE_LINUX_FUTEX >>> -#define _GLIBCXX_HAVE_PLATFORM_WAIT 1 >>> using __platform_wait_t = int; >>> inline constexpr size_t __platform_wait_alignment = 4; >>> + } >>> + template<typename _Tp> >>> + inline constexpr bool __platform_wait_uses_type >>> + = is_scalar_v<_Tp> && sizeof(_Tp) == sizeof(int) && alignof(_Tp) >= >>> 4; >>> #else >>> +# define _GLIBCXX_UNKNOWN_PLATFORM_WAIT 1 >>> // define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait() >>> // and __platform_notify() if there is a more efficient primitive supported >>> // by the platform (e.g. __ulock_wait()/__ulock_wake()) which is better >>> than >>> // a mutex/condvar based wait. >>> + namespace __detail >>> + { >>> # if ATOMIC_LONG_LOCK_FREE == 2 >>> using __platform_wait_t = unsigned long; >>> # else >>> using __platform_wait_t = unsigned int; >>> # endif >>> inline constexpr size_t __platform_wait_alignment >>> - = __alignof__(__platform_wait_t); >>> -#endif >>> + = sizeof(__platform_wait_t) < __alignof__(__platform_wait_t) >>> + ? __alignof__(__platform_wait_t) : sizeof(__platform_wait_t); >>> } // namespace __detail >>> - >>> - template<typename _Tp> >>> - inline constexpr bool __platform_wait_uses_type >>> -#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT >>> - = is_scalar_v<_Tp> >>> - && ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t)) >>> - && (alignof(_Tp) >= __detail::__platform_wait_alignment)); >>> -#else >>> - = false; >>> + template<typename> >>> + inline constexpr bool __platform_wait_uses_type = false; >>> #endif >>> >>> namespace __detail >>> @@ -105,10 +104,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >>> return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) == 0; >>> } >>> >>> - // lightweight std::optional<__platform_wait_t> >>> + // TODO: this needs to be false for types with padding, e.g. __int20. >> >> I do not understand why this needs to be required. This funciton is used >> only via atomic >> or atomic_ref. For atomic, we can guarantee that the type has padding bytes >> cleared.. >>> >>> + // TODO: should this be true only for integral, enum, and pointer >>> types? >> >> What I think is missing here is alignment. I assume that any platform wait >> may use >> bits that are clear due the alignment of platform wait type for some >> internal state. >> Or we are going to check is_sufficiently_aligment in cc file, and use >> different kind of >> wait depending on the object? >> >> But I think, we can later safely extend or change what is waitable (except >> extending it past 8 bytes), >> as if we start putting _M_obj_size to non zero, impl may use platform wait. > > After pounding a bit on this, I realized that this requirement will be set in > stone after shipping, > because the old TU calling wait needs to agree with the new TU calling > notify, or vice versa. > I.e. set of types that are waitable need to be the same on both sides.
Yes. So if we exclude float and double today, we exclude them forever. I think that's OK. I think it's OK for atomic wait/notify to be suboptimal for non-integral types. It would be preferable if it's optimal for all integral types that the OS can support (and this patch tries to ensure that we aren't restricted to the types that the OS supports *today* but can evolve if the OS supports more types in future, or if somebody implements __platform_wait and __platform_timed_wait and __platform_notify for a new OS in future). I think it would be nice if atomic wait/notify on enums was optimal, because fundamentally they're just integers, and so atomic<enum32> and atomic<int> can have equal performance. I don't think atomic<float> needs to be optimal, nobody really expects float to behave exactly like int just because they have the same size.
