On Mon, 17 Nov 2025 at 10:10 +0100, Tomasz Kaminski wrote:
On Sun, Nov 16, 2025 at 1:56 AM Jonathan Wakely <[email protected]> wrote:

This will allow us to extend atomic waiting functions to support a
possible future 64-bit version of futex, as well as supporting
futex-like wait/wake primitives on other targets (e.g. macOS has
os_sync_wait_on_address and FreeBSD has _umtx_op).

Before this change, the decision of whether to do a proxy wait or to
wait on the atomic variable itself was made in the header at
compile-time, which makes it an ABI property that would not have been
possible to change later.  That would have meant that
std::atomic<uint64_t> would always have to do a proxy wait even if Linux
gains support for 64-bit futex2(2) calls at some point in the future.
The disadvantage of proxy waits is that several distinct atomic objects
can share the same proxy state, leading to contention between threads
even when they are not waiting on the same atomic object, similar to
false sharing. It also result in spurious wake-ups because doing a
notify on an atomic object that uses a proxy wait will wake all waiters
sharing the proxy.

For types that are known to definitely not need a proxy wait (e.g. int
on Linux) the header can still choose a more efficient path at
compile-time. But for other types, the decision of whether to do a proxy
wait is deferred to runtime, inside the library internals. This will
make it possible for future versions of libstdc++.so to extend the set
of types which don't need to use proxy waits, without ABI changes.

The way the change works is to stop using the __proxy_wait flag that was
set by the inline code in the headers. Instead the __wait_args struct
has an extra pointer member which the library internals populate with
either the address of the atomic object or the _M_ver counter in the
proxy state. There is also a new _M_obj_size member which stores the
size of the atomic object, so that the library can decide whether a
proxy is needed. So for example if linux gains 64-bit futex support then
the library can decide not to use a proxy when _M_obj_size == 8.
Finally, the _M_old member of the __wait_args struct is changed to
uint64_t so that it has room to store 64-bit values, not just whatever
size the __platform_wait_t type is (which is a 32-bit int on Linux).
Similarly, the _M_val member of __wait_result_type changes to uint64_t
too.

libstdc++-v3/ChangeLog:

        * config/abi/pre/gnu.ver: Adjust exports.
        * include/bits/atomic_timed_wait.h
(_GLIBCXX_HAVE_PLATFORM_TIMED_WAIT):
        Do not define this macro.
        (__atomic_wait_address_until_v, __atomic_wait_address_for_v):
        Guard assertions with #ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT.
        * include/bits/atomic_wait.h (__platform_wait_uses_type):
        Different separately for platforms with and without platform
        wait.
        (_GLIBCXX_HAVE_PLATFORM_WAIT): Do not define this macro.
        (_GLIBCXX_UNKNOWN_PLATFORM_WAIT): Define new macro.
        (__wait_value_type): New typedef.
        (__wait_result_type): Change _M_val to __wait_value_type.
        (__wait_args_base::_M_old): Change to __wait_args_base.
        (__wait_args_base::_M_obg, __wait_args_base::_M_obj_size): New
        data members.
        (__wait_args::__wait_args): Set _M_obj and _M_obj_size on
        construction.
        (__wait_args::_M_setup_wait): Change void* parameter to deduced
        type. Use _S_bit_cast instead of __builtin_bit_cast.
        (__wait_args::_M_load_proxy_wait_val): Remove function, replace
        with ...
        (__wait_args::_M_setup_wait_impl): New function.
        (__wait_args::_S_bit_cast): Wrapper for __builtin_bit_cast which
        also supports conversion from 32-bit values.
        (__wait_args::_S_flags_for): Do not set __proxy_wait flag.
        (__atomic_wait_address_v): Guard assertions with #ifdef
        _GLIBCXX_UNKNOWN_PLATFORM_WAIT.
        * src/c++20/atomic.cc (_GLIBCXX_HAVE_PLATFORM_WAIT): Define here
        instead of in header. Check _GLIBCXX_HAVE_PLATFORM_WAIT instead
        of _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT.
        (__spin_impl): Adjust for 64-bit __wait_args_base::_M_old.
        (use_proxy_wait): New function.
        (__wait_args::_M_load_proxy_wait_val): Replace with ...
        (__wait_args::_M_setup_wait_impl): New function. Call
        use_proxy_wait to decide at runtime whether to wait on the
        pointer directly instead of using a proxy. If a proxy is needed,
        set _M_obj to point to its _M_ver member. Adjust for change to
        type of _M_old.
        (__wait_impl): Wait on _M_obj unconditionally.
        (__notify_impl): Call use_proxy_wait to decide whether to notify
        on the address parameter or a proxy
        (__spin_until_impl): Adjust for change to type of _M_val.
        (__wait_until_impl): Wait on _M_obj unconditionally.
---

Tested x86_64-linux, powerpc64le-linux, sparc-solaris.

A lot of comments below.


I think this is an imporant change which I unfortunately didn't think of
until recently.

This changes the exports from the shared library, but we're still in
stage 1 so I think that should be allowed (albeit unfortunate). Nobody
should be expecting GCC 16 to be stable yet.

The __proxy_wait enumerator is now unused and could be removed. The
__abi_version enumerator could also be bumped to indicate the
incompatibility with earlier snapshots of GCC 16, but I don't think that
is needed. We could in theory keep the old symbol export
(__wait_args::_M_load_proxy_wait) and make it trap/abort if called, but
I'd prefer to just remove it and cause dynamic linker errors instead.

There's a TODO in the header about which types should be allowed to take
the optimized paths (see the __waitable concept). For types where that's
true, if the size matches a futex then we'll use a futex, even if it's
actually an enum or floating-point type (or pointer on 32-bit targets).
I'm not sure if that's safe.


 libstdc++-v3/config/abi/pre/gnu.ver           |   3 +-
 libstdc++-v3/include/bits/atomic_timed_wait.h |  12 +-
 libstdc++-v3/include/bits/atomic_wait.h       | 109 +++++++++-----
 libstdc++-v3/src/c++20/atomic.cc              | 140 +++++++++++-------
 4 files changed, 166 insertions(+), 98 deletions(-)

diff --git a/libstdc++-v3/config/abi/pre/gnu.ver
b/libstdc++-v3/config/abi/pre/gnu.ver
index 2e48241d51f9..3c2bd4921730 100644
--- a/libstdc++-v3/config/abi/pre/gnu.ver
+++ b/libstdc++-v3/config/abi/pre/gnu.ver
@@ -2553,7 +2553,8 @@ GLIBCXX_3.4.35 {
     _ZNSt8__detail11__wait_implEPKvRNS_16__wait_args_baseE;
     _ZNSt8__detail13__notify_implEPKvbRKNS_16__wait_args_baseE;

 
_ZNSt8__detail17__wait_until_implEPKvRNS_16__wait_args_baseERKNSt6chrono8durationI[lx]St5ratioIL[lx]1EL[lx]1000000000EEEE;
-    _ZNSt8__detail11__wait_args22_M_load_proxy_wait_valEPKv;
+    _ZNSt8__detail11__wait_args18_M_setup_wait_implEPKv;
+    _ZNSt8__detail11__wait_args20_M_setup_notify_implEPKv;

     # std::chrono::gps_clock::now, tai_clock::now
     _ZNSt6chrono9gps_clock3nowEv;
diff --git a/libstdc++-v3/include/bits/atomic_timed_wait.h
b/libstdc++-v3/include/bits/atomic_timed_wait.h
index 30f7ff616840..918a267d10eb 100644
--- a/libstdc++-v3/include/bits/atomic_timed_wait.h
+++ b/libstdc++-v3/include/bits/atomic_timed_wait.h
@@ -75,14 +75,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
          return chrono::ceil<__w_dur>(__atime);
       }

-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-#define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
-#else
-// define _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT and implement
__platform_wait_until
-// if there is a more efficient primitive supported by the platform
-// (e.g. __ulock_wait) which is better than pthread_cond_clockwait.
-#endif // ! HAVE_LINUX_FUTEX
-
     __wait_result_type
     __wait_until_impl(const void* __addr, __wait_args_base& __args,
                      const __wait_clock_t::duration& __atime);
@@ -156,7 +148,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
                                  const chrono::time_point<_Clock, _Dur>&
__atime,
                                  bool __bare_wait = false) noexcept
     {
-#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
       __glibcxx_assert(false); // This function can't be used for proxy
wait.
 #endif
       __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
@@ -208,7 +200,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
                                const chrono::duration<_Rep, _Period>&
__rtime,
                                bool __bare_wait = false) noexcept
     {
-#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT

This name really reads strange, and sounds like something with "TODO".
I think  _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT was just OK name, even
if it was not used directly.

The rationale for the new name is that when the header gets
preprocessed we don't yet know if the libstdc++ shared lib that will
be used at runtime has a platform wait function available. A
translation unit might get compiled with GCC 16 and so there is no
known platform wait for macOS, but then at runtime it might link to
the libstdc++.dylib from GCC 17 which uses ulock_wait.

And the reason for getting rid of HAVE_PLATFORM_TIMED_WAIT is twofold:
- I don't think it makes sense to have separate HAVE_PLATFORM_WAIT and
  HAVE_PLATFORM_TIMED_WAIT macros. I doubt any target is going to
  support a futex-like operation that doesn't support a timeout (and
  if there is such an operation, we should just not use it at all).
- The macro saying whether we have a platform wait operation is now
  only defined inside the library, not in the header. It's not
  something that the header needs to know (or can know).

But I will try to improve the name of UNKNOWN_PLATFORM_WAIT.

       __glibcxx_assert(false); // This function can't be used for proxy
wait.
 #endif
       __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
diff --git a/libstdc++-v3/include/bits/atomic_wait.h
b/libstdc++-v3/include/bits/atomic_wait.h
index 95151479c120..49369419d6a6 100644
--- a/libstdc++-v3/include/bits/atomic_wait.h
+++ b/libstdc++-v3/include/bits/atomic_wait.h
@@ -45,35 +45,34 @@
 namespace std _GLIBCXX_VISIBILITY(default)
 {
 _GLIBCXX_BEGIN_NAMESPACE_VERSION
+#if defined _GLIBCXX_HAVE_LINUX_FUTEX
   namespace __detail
   {
-#ifdef _GLIBCXX_HAVE_LINUX_FUTEX
-#define _GLIBCXX_HAVE_PLATFORM_WAIT 1
     using __platform_wait_t = int;
     inline constexpr size_t __platform_wait_alignment = 4;
+  }
+  template<typename _Tp>
+    inline constexpr bool __platform_wait_uses_type
+      = is_scalar_v<_Tp> && sizeof(_Tp) == sizeof(int) && alignof(_Tp) >=
4;
 #else
+# define _GLIBCXX_UNKNOWN_PLATFORM_WAIT 1
 // define _GLIBCX_HAVE_PLATFORM_WAIT and implement __platform_wait()
 // and __platform_notify() if there is a more efficient primitive
supported
 // by the platform (e.g. __ulock_wait()/__ulock_wake()) which is better
than
 // a mutex/condvar based wait.
+  namespace __detail
+  {
 # if ATOMIC_LONG_LOCK_FREE == 2
     using __platform_wait_t = unsigned long;
 # else
     using __platform_wait_t = unsigned int;
 # endif
     inline constexpr size_t __platform_wait_alignment
-      = __alignof__(__platform_wait_t);
-#endif
+      = sizeof(__platform_wait_t) < __alignof__(__platform_wait_t)
+         ? __alignof__(__platform_wait_t) : sizeof(__platform_wait_t);
   } // namespace __detail
-
-  template<typename _Tp>
-    inline constexpr bool __platform_wait_uses_type
-#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
-      = is_scalar_v<_Tp>
-       && ((sizeof(_Tp) == sizeof(__detail::__platform_wait_t))
-       && (alignof(_Tp) >= __detail::__platform_wait_alignment));
-#else
-      = false;
+  template<typename>
+    inline constexpr bool __platform_wait_uses_type = false;
 #endif

   namespace __detail
@@ -105,10 +104,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
        return __builtin_memcmp(&__a, &__b, sizeof(_Tp)) == 0;
       }

-    // lightweight std::optional<__platform_wait_t>
+    // TODO: this needs to be false for types with padding, e.g. __int20.

I do not understand why this needs to be required. This funciton is used
only via atomic
or atomic_ref. For atomic, we can guarantee that the type has padding bytes
cleared..

When we pass the "old" value to the kernel, I don't know if we can
guarantee that cleared padding bits remain cleared. We can be sure
that the value inside the std::atomic has padding bits cleared to
zero, but the std::atomic<T>::wait(T) function takes its argument by
value, and also passes that to the kernel by value.

I think it's safer to just do proxy waits for types with padding.

+    // TODO: should this be true only for integral, enum, and pointer
types?

What I think is missing here is alignment. I assume that any platform wait
may use
bits that are clear due the alignment of platform wait type for some
internal state.

I don't understand what you mean here. The platform wait type is just
an integer, so all bits are part of the value representation (if we
ignore weird types like __int20 on msp430). A pointer to that type
might have "unused" low bits due to alignment, but we form that
pointer by taking the address of the atomic object, and we pass it to
the library, and we pass it to the OS wake/wait operation. I don't
think there's any opportunity for anything to ever set (or read) bits
for internal state.

Or we are going to check is_sufficiently_aligment in cc file, and use
different kind of
wait depending on the object?

Yes, we can check the alignment of the __addr pointer when deciding
whether to use the platform wait operation. But the objects that
__addr points to are always the value inside a std::atomic or the
object that std::atomic_ref::_M_ptr points to, and we already know
that those objects are suitably aligned for atomic ops. It's possible
that some platform wait has stricter alignment requirements (e.g. you
can do atomic add/subtract if it's 4-byte aligned by you can't
wait/notify unless it's 16-byte aligned) but I don't know of any cases
where that's true (for macOS the object only needs to be aligned to
its size, which is already guaranteed by std::atomic and required by
std::atomic_ref, for FreeBSD it must be aligned to "ABI-mandated
alignment" which isn't very clear to me, but I don't think it means
"aligned more than the type usually requires").

If we want to use some new wait/wake primitive in future and it
requires stricter alignment, we can just check the alignment of __addr
in the new use_proxy_wait function in src/c++20/atomic.cc. That
function doesn't currently take the original __addr pointer, because
it doesn't currently need it, but it's not a public function so we can
just change it any time we need to. I can change it to take the
__addr argument now, marked [[maybe_unused]], if that would make it
clearer that the decision can (in theory) depend on the address.

But I don't think the __waitable concept should consider alignment.

The objects we do atomic wait/notify on are already aligned suitably
for atomic ops, and the address of any individual object can be used
to decide whether it's suitably aligned for wait/notify, it doesn't
need to be a static property of the type.

The point of this __waitable check is "does this type look like
something that we should consider using as a futex", which for linux
today means "can it safely be reinterpret_cast to int& for the
kernel's purposes" (where I'm relying on the fact that the kernel can
happily ignore type-based alias analysis and just cares about the
bits!)

We should probably not treat struct { char c; /*padding;*/ short i; }
as a futex, but we can treat an enum as a futex, because as far as the
kernel's futex syscall is concerned, it's just some bits and it either
equals some value or it doesn't.

But I think, we can later safely extend or change what is waitable (except
extending it past 8 bytes),
as if we start putting _M_obj_size to non zero, impl may use platform wait.

So, I will go with safe option is integral, pointer or enum, This would

That means always using a proxy wait for float and double, but I think
that's OK.

also give
us no padding guarantee, I assume?

__int20 is integral and has padding bits. But it's a very special case
and only present on one or two non-mainstream targets.

+    template<typename _Tp>
+      concept __waitable
+       = is_scalar_v<_Tp> && (sizeof(_Tp) <= sizeof(__UINT64_TYPE__));
+
+    // Storage for up to 64 bits of value, should be considered opaque
bits.
+    using __wait_value_type = __UINT64_TYPE__;
+
+    // lightweight std::optional<__wait_value_type>
     struct __wait_result_type
     {
-      __platform_wait_t _M_val;
+      __wait_value_type _M_val;
       unsigned char _M_has_val : 1; // _M_val value was loaded before
return.
       unsigned char _M_timeout : 1; // Waiting function ended with
timeout.
       unsigned char _M_unused : 6;  // padding
@@ -143,8 +151,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
     {
       __wait_flags _M_flags;
       int _M_order = __ATOMIC_ACQUIRE;
-      __platform_wait_t _M_old = 0;
+      __wait_value_type _M_old{};
       void* _M_wait_state = nullptr;
+      const void* _M_obj = nullptr;  // The address of the object to wait
on.
+      unsigned char _M_obj_size = 0; // The size of that object.

       // Test whether _M_flags & __flags is non-zero.
       bool
@@ -162,36 +172,48 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
        explicit
        __wait_args(const _Tp* __addr, bool __bare_wait = false) noexcept
        : __wait_args_base{ _S_flags_for(__addr, __bare_wait) }
-       { }
+       {
+         _M_obj = __addr; // Might be replaced by _M_setup_wait
+         if constexpr (__waitable<_Tp>)
+           // __wait_impl might be able to wait directly on __addr
+           // instead of using a proxy, depending on its size.
+           _M_obj_size = sizeof(_Tp);
+       }

       __wait_args(const __platform_wait_t* __addr, __platform_wait_t
__old,
                  int __order, bool __bare_wait = false) noexcept
-      : __wait_args_base{ _S_flags_for(__addr, __bare_wait), __order,
__old }
-      { }
+      : __wait_args(__addr, __bare_wait)
+      {
+       _M_order = __order;
+       _M_old = __old;
+      }

       __wait_args(const __wait_args&) noexcept = default;
       __wait_args& operator=(const __wait_args&) noexcept = default;

-      template<typename _ValFn,
-              typename _Tp = decay_t<decltype(std::declval<_ValFn&>()())>>
+      template<typename _Tp, typename _ValFn>
        _Tp
-       _M_setup_wait(const void* __addr, _ValFn __vfn,
+       _M_setup_wait(const _Tp* __addr, _ValFn __vfn,
                      __wait_result_type __res = {})
        {
+         static_assert(is_same_v<_Tp, decay_t<decltype(__vfn())>>);
+
          if constexpr (__platform_wait_uses_type<_Tp>)
            {
-             // If the wait is not proxied, the value we check when
waiting
-             // is the value of the atomic variable itself.
+             // If we know for certain that this type can be waited on
+             // efficiently using something like a futex syscall,
+             // then we can avoid the overhead of _M_setup_wait_impl
+             // and just load the value and store it into _M_old.

-             if (__res._M_has_val) // The previous wait loaded a recent
value.
+             if (__res._M_has_val) // A previous wait loaded a recent
value.
                {
                  _M_old = __res._M_val;
-                 return __builtin_bit_cast(_Tp, __res._M_val);
+                 return _S_bit_cast<_Tp>(_M_old);

I am not sure if I understand which branch of _S_bit_cast this would use,
we have neither (sizeof(To) == sizeof(From) i.e. 4 vs 8) and
neither sizeof(_From) == sizeof(__UINT32_TYPE__).

Oops, yes, that's a bug in _S_bit_cast. It was supposed to also handle
the case where we only want some of the bits in __from. Apparently I
lost that branch of _S_bit_cast in some revision of the patch.

I would much more prefer if this would be dome as:
return __builtin_bit_cast(_Tp, static_cast<__UINT32_TYPE__>(_M_old));

That assumes that only 32 bits of the values are used, but the header
doesn't know that.


                }
              else // Load the value from __vfn
                {
-                 _Tp __val = __vfn();
-                 _M_old = __builtin_bit_cast(__platform_wait_t, __val);
+                 auto __val = __vfn();
+                 _M_old = _S_bit_cast<__wait_value_type>(__val);

And here:
     _M_ old = __builtin_bit_cast(__UINT32_TYPE__, __val);
However, instead of __UINT32_TYPE__, we should use
make_unsinged<__platform_wait_t>.

No, again, this assumes that only sizeof(__platform_wait_t) bytes are
useful. That is the design mistake in the current code. It assumes
that we will never do non-proxy wait on something larger than today's
__platform_wait_t. For a future version of linux, we will not change
__platform_wait_t (it will still be int) but we might start doing
non-proxy wait on 64-bit values. In that case, all bits of _M_val and
_M_old would be used, and casting to uint32_t in the header would
truncate the values.


                  return __val;
                }
            }
@@ -199,16 +221,33 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
            {
              if (__res._M_has_val) // The previous wait loaded a recent
value.
                _M_old = __res._M_val;
-             else // Load _M_ver from the proxy (must happen before
__vfn()).
-               _M_load_proxy_wait_val(__addr);
+             else // Let the library decide how to setup the wait.
+               {
+                 // Set _M_obj to the address to be waited on (either
__addr
+                 // or a proxy) and load its current value into _M_old.
+                 _M_setup_wait_impl(__addr);
+               }
              return __vfn();
            }
        }

     private:
-      // Populates _M_wait_state and _M_old from the proxy for __addr.
+      template<typename _To, typename _From>
+       [[__gnu__::__always_inline__]]
+       static _To
+       _S_bit_cast(_From __from)
+       {
+         if constexpr (sizeof(_To) == sizeof(_From))
+           return __builtin_bit_cast(_To, __from);
+         else if constexpr (sizeof(_From) == sizeof(__UINT32_TYPE__))
+           return __builtin_bit_cast(__UINT32_TYPE__, __from);
+         else
+           __builtin_unreachable();
+       }
+
+      // Populates _M_wait_state and _M_old appropriately for __addr.
       void
-      _M_load_proxy_wait_val(const void* __addr);
+      _M_setup_wait_impl(const void* __addr);

       template<typename _Tp>
        static constexpr __wait_flags
@@ -218,8 +257,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
          __wait_flags __res = __abi_version | __do_spin;
          if (!__bare_wait)
            __res |= __track_contention;
-         if constexpr (!__platform_wait_uses_type<_Tp>)
-           __res |= __proxy_wait;
+         // if constexpr (!__platform_wait_uses_type<_Tp>)
+         // __res |= __proxy_wait;

Remove this simply.

Yes, I wanted to get the patch sent in stage 1 so there is some
clean-up still needed. The __proxy_wait enumerator should be removed
entirely.

          return __res;
        }
     };
@@ -255,7 +294,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
                          __detail::__platform_wait_t __old,
                          int __order, bool __bare_wait = false)
   {
-#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
+#ifdef _GLIBCXX_UNKNOWN_PLATFORM_WAIT
     __glibcxx_assert(false); // This function can't be used for proxy
wait.
 #endif
     __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
diff --git a/libstdc++-v3/src/c++20/atomic.cc
b/libstdc++-v3/src/c++20/atomic.cc
index e280045b619d..d5fbd10ed11f 100644
--- a/libstdc++-v3/src/c++20/atomic.cc
+++ b/libstdc++-v3/src/c++20/atomic.cc
@@ -38,14 +38,7 @@
 # include <syscall.h>
 # include <bits/functexcept.h>
 # include <sys/time.h>
-#endif
-
-#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
-# ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
-// __waitable_state assumes that we consistently use the same
implementation
-// (i.e. futex vs mutex+condvar) for timed and untimed waiting.
-#  error "This configuration is not currently supported"
-# endif
+# define _GLIBCXX_HAVE_PLATFORM_WAIT 1
 #endif

 #pragma GCC diagnostic ignored "-Wmissing-field-initializers"
@@ -107,7 +100,7 @@ namespace
     // Count of threads blocked waiting on this state.
     alignas(_S_align) __platform_wait_t _M_waiters = 0;

-#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
     mutex _M_mtx;

     // This type meets the Cpp17BasicLockable requirements.
@@ -123,7 +116,7 @@ namespace
     // use this for waiting and notifying functions instead.
     alignas(_S_align) __platform_wait_t _M_ver = 0;

-#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
     __condvar _M_cv;
 #endif

@@ -216,17 +209,19 @@ namespace
   __spin_impl(const __platform_wait_t* __addr, const __wait_args_base&
__args)
   {
     __platform_wait_t __val{};
+    __wait_value_type wval;
     for (auto __i = 0; __i < __atomic_spin_count; ++__i)
       {
        __atomic_load(__addr, &__val, __args._M_order);
-       if (__val != __args._M_old)
-         return { ._M_val = __val, ._M_has_val = true, ._M_timeout =
false };
+       wval = static_cast<__wait_value_type>(__val);
+       if (wval != __args._M_old)
+         return { ._M_val = wval, ._M_has_val = true, ._M_timeout = false
};
        if (__i < __atomic_spin_count_relax)
          __thread_relax();
        else
          __thread_yield();
       }
-    return { ._M_val = __val, ._M_has_val = true, ._M_timeout = true };
+    return { ._M_val = wval, ._M_has_val = true, ._M_timeout = true };
   }

   inline __waitable_state*
@@ -237,32 +232,58 @@ namespace
     return static_cast<__waitable_state*>(args._M_wait_state);
   }

+  [[gnu::always_inline]]
+  inline bool
+  use_proxy_wait(const __wait_args_base& args)

This needs to take the address that we are waiting on and check if it's
sufficiently aligned
to __platform_wait_aligment.

Maybe in future yes, but today the only platform where use_proxy_wait
returns false is Linux, and we know that all 4-byte objects in
std::atomic or std::atomic_ref are also 4-byte aligned (or should be,
once PR 122410 gets fixed). So I don't agree that it *needs* to take
the address. It doesn't need to today, because it doesn't use it, and
if it did a runtime check that the pointer is aligned to at least
4-bytes, that would be wasted cycles today because it's already
guaranteed by std::atomic and std::atomic_ref.

But I can add the address argument, marked [[maybe_unused]] for now.

However, I do not think we should check unit64_t at all, as we do not have
such platform
wait yet, and we will not change __platform_wait_uses_type in that case.

Yes, the 64-bit check could be removed and added back when macOS
support gets added here (I am already working on that patch as a proof
of concept for Iain to test, but it might not be ready today).


+  {
+    if constexpr (__platform_wait_uses_type<uint32_t>)
+      if (args._M_obj_size == sizeof(uint32_t))
+       return false;
+
+    if constexpr (__platform_wait_uses_type<uint64_t>)
+      if (args._M_obj_size == sizeof(uint64_t))
+       return false;
+
+    // Use proxy wait for everything else:
+    return true;
+  }
+
 } // namespace

-// Called for a proxy wait
+// Set _M_wait_state if using proxy wait, or caller wants contention
tracking.
+// Set _M_obj to &_M_wait_state->_M_ver if using proxy wait.
+// Load the current value from _M_obj and store in _M_val.
 void
-__wait_args::_M_load_proxy_wait_val(const void* addr)
+__wait_args::_M_setup_wait_impl(const void* addr)

I would use _M_obj_size to indicate if proxy wait is used for the actual
await.

As you noted in the follow-up mail, that doesn't work, because
_M_setup_wait_impl is never called in the thread that calls
__notify_impl.

I did consider adding a _M_setup_notify_impl function, which would
adjust _M_obj and could set _M_obj_size=0 for proxy waits. But there
would be no benefit. Unlike std::atomic::wait, there is no loop in
std::atomic::notify, it just calls __notify_impl and returns. So if we
added a separate _M_setup_notify step that needs to be run before
__notify_impl then we're just doing two PLT calls into libstdc++.so
instead of one PLT call. That would just makes things slower.

So __notify_impl needs to repeat the "does this combination of __addr
and _M_obj_size use a proxy wait" check directly, and that's what the
use_proxy_wait function does. There's no point modifying the
__wait_args to "remember" that decision for subsequent calls to
__notify_impl, because there will never be another call to
__notify_impl using the same __wait_args. The next call will use a new
__wait_args object, constructed by a different caller.

 {
-  // __glibcxx_assert( *this & __wait_flags::__proxy_wait );
+  if (!use_proxy_wait(*this))

Do; _M_obj_size > 0 &&  !use_proxy_wait(*this)

_M_obj_size is available in use_proxy_wait, so I would prefer to check
it in use_proxy_wait.

+    {
+      // We can wait on this address directly.
+      __glibcxx_assert(_M_obj == addr);

This funciton is one that is expected to set _M_obj, so it seems incorrect
to assert it here.

As noted in your follow-up mail, that's not true. It's set in the
__wait_args ctor, and only set here for a proxy wait. This assertion
is ensuring that it really was set in the ctor.


-  // We always need a waitable state for proxy waits.
+      int val;
+      __atomic_load(static_cast<const int*>(addr), &val,
__ATOMIC_ACQUIRE);
+      _M_old = val;
+
+      return;
+    }
+
+  // This will be a proxy wait, so get a waitable state.
   auto state = set_wait_state(addr, *this);

+  // The address we will wait on is the version count of the waitable
state:
+  _M_obj = &state->_M_ver;

+
   // Read the value of the _M_ver counter.
-  __atomic_load(&state->_M_ver, &_M_old, __ATOMIC_ACQUIRE);
+  __platform_wait_t __old;
+  __atomic_load(&state->_M_ver, &__old, __ATOMIC_ACQUIRE);

Nit pick, but I think this could be just:
   __platform_wait_t __old = __atomic_load_n(&state->_M_ver,
__ATOMIC_ACQUIRE);

Yes, I think I got confused by the fact that Clang doesn't support
__atomic_load_4 and decided that it doesn't support __atomic_load_n
either.


+  _M_old = __old;

Or just:
 _M_old = __atomic_load_n(&state->_M_ver, __ATOMIC_ACQUIRE);

Yes, I'll do that.

 }

 __wait_result_type
-__wait_impl(const void* __addr, __wait_args_base& __args)
+__wait_impl([[maybe_unused]] const void* __addr, __wait_args_base& __args)
 {
-  auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
-
-  const __platform_wait_t* __wait_addr;
-
-  if (__args & __wait_flags::__proxy_wait)
-    __wait_addr = &__state->_M_ver;
-  else
-    __wait_addr = static_cast<const __platform_wait_t*>(__addr);
+  auto* __wait_addr = static_cast<const
__platform_wait_t*>(__args._M_obj);

   if (__args & __wait_flags::__do_spin)
     {
@@ -286,6 +307,7 @@ __wait_impl(const void* __addr, __wait_args_base&
__args)
   __atomic_load(__wait_addr, &__val, __args._M_order);
   if (__val == __args._M_old)
     {
+      auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
       __state->_M_cv.wait(__state->_M_mtx);
       return { ._M_val = __val, ._M_has_val = false, ._M_timeout = false
};
     }
@@ -294,24 +316,40 @@ __wait_impl(const void* __addr, __wait_args_base&
__args)
 }

 void
-__notify_impl(const void* __addr, [[maybe_unused]] bool __all,
+__notify_impl([[maybe_unused]] const void* __addr, [[maybe_unused]] bool
__all,
              const __wait_args_base& __args)
 {
-  auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
-  if (!__state)
-    __state = &__waitable_state::_S_state_for(__addr);
+  const bool __track_contention = __args &
__wait_flags::__track_contention;
+  const bool proxy_wait = use_proxy_wait(__args);

And here we could just check if _M_obj_size == 0, as this would indication
of proxy wait.


-  [[maybe_unused]] const __platform_wait_t* __wait_addr;
+  [[maybe_unused]] auto* __wait_addr
+    = static_cast<const __platform_wait_t*>(__addr);
+
+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
+  // Check whether it would be a non-proxy wait for this object.
+  // This condition must match the one in _M_setup_wait_impl to ensure
that
+  // the address used for the notify matches the one used for the wait.
+  if (!proxy_wait)
+    {
+      if (__track_contention)
+       if (!__waitable_state::_S_state_for(__addr)._M_waiting())
+         return;
+
+      __platform_notify(__wait_addr, __all);
+      return;
+    }
+#endif
+
+  // Either a proxy wait or we don't have platform wait/wake primitives.
+
+  auto __state = &__waitable_state::_S_state_for(__addr);

   // Lock mutex so that proxied waiters cannot race with incrementing
_M_ver
   // and see the old value, then sleep after the increment and
notify_all().
   lock_guard __l{ *__state };

-  if (__args & __wait_flags::__proxy_wait)
+  if (proxy_wait)
     {
-      // Waiting for *__addr is actually done on the proxy's _M_ver.
-      __wait_addr = &__state->_M_ver;
-
       // Increment _M_ver so that waiting threads see something changed.
       // This has to be atomic because the load in _M_load_proxy_wait_val
       // is done without the mutex locked.
@@ -322,11 +360,11 @@ __notify_impl(const void* __addr, [[maybe_unused]]
bool __all,
       // they can re-evaluate their conditions to see if they should
       // stop waiting or should wait again.
       __all = true;
-    }
-  else // Use the atomic variable's own address.
-    __wait_addr = static_cast<const __platform_wait_t*>(__addr);

-  if (__args & __wait_flags::__track_contention)
+      __wait_addr = &__state->_M_ver;
+    }
+
+  if (__track_contention)
     {
       if (!__state->_M_waiting())
        return;
@@ -366,7 +404,7 @@ __platform_wait_until(const __platform_wait_t* __addr,
 }
 #endif // HAVE_LINUX_FUTEX

-#ifndef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+#ifndef _GLIBCXX_HAVE_PLATFORM_WAIT
 bool
 __cond_wait_until(__condvar& __cv, mutex& __mx,
                  const __wait_clock_t::time_point& __atime)
@@ -381,7 +419,7 @@ __cond_wait_until(__condvar& __cv, mutex& __mx,
     __cv.wait_until(__mx, __ts);
   return __wait_clock_t::now() < __atime;
 }
-#endif // ! HAVE_PLATFORM_TIMED_WAIT
+#endif // ! HAVE_PLATFORM_WAIT

 // Unlike __spin_impl, does not always return _M_has_val == true.
 // If the deadline has already passed then no fresh value is loaded.
@@ -414,7 +452,9 @@ __spin_until_impl(const __platform_wait_t* __addr,
            return __res;
        }

-      __atomic_load(__addr, &__res._M_val, __args._M_order);
+      __platform_wait_t val;
+      __atomic_load(__addr, &val, __args._M_order);

Same comment on __atomic_load_n here.

Agreed.

+      __res._M_val = val;
       __res._M_has_val = true;
       if (__res._M_val != __args._M_old)
        {
@@ -428,16 +468,11 @@ __spin_until_impl(const __platform_wait_t* __addr,
 } // namespace

 __wait_result_type
-__wait_until_impl(const void* __addr, __wait_args_base& __args,
+__wait_until_impl([[maybe_unused]] const void* __addr, __wait_args_base&
__args,
                  const __wait_clock_t::duration& __time)
 {
   const __wait_clock_t::time_point __atime(__time);
-  auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
-  const __platform_wait_t* __wait_addr;
-  if (__args & __wait_flags::__proxy_wait)
-    __wait_addr = &__state->_M_ver;
-  else
-    __wait_addr = static_cast<const __platform_wait_t*>(__addr);
+  auto* __wait_addr = static_cast<const
__platform_wait_t*>(__args._M_obj);

   if (__args & __wait_flags::__do_spin)
     {
@@ -448,9 +483,9 @@ __wait_until_impl(const void* __addr,
__wait_args_base& __args,
        return __res;
     }

-#ifdef _GLIBCXX_HAVE_PLATFORM_TIMED_WAIT
+#ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
   if (__args & __wait_flags::__track_contention)
-    set_wait_state(__addr, __args);
+    set_wait_state(__addr, __args); // scoped_wait needs a
__waitable_state
   scoped_wait s(__args);
   bool timeout = !__platform_wait_until(__wait_addr, __args._M_old,
__atime);
   return { ._M_val = __args._M_old, ._M_has_val = false, ._M_timeout =
timeout };
@@ -460,6 +495,7 @@ __wait_until_impl(const void* __addr,
__wait_args_base& __args,
   __atomic_load(__wait_addr, &__val, __args._M_order);
   if (__val == __args._M_old)
     {
+      auto __state = static_cast<__waitable_state*>(__args._M_wait_state);
       bool timeout = !__cond_wait_until(__state->_M_cv, __state->_M_mtx,
__atime);
       return { ._M_val = __val, ._M_has_val = false, ._M_timeout =
timeout };
     }
--
2.51.1



Reply via email to