Re: [PATCH] libstdc++: Fix spinloop in atomic timed waiting function [PR122878]

Tomasz Kaminski Fri, 28 Nov 2025 03:40:46 -0800

On Fri, Nov 28, 2025 at 12:33 PM Jonathan Wakely <[email protected]> wrote:


> On Fri, 28 Nov 2025 at 08:53, Tomasz Kaminski <[email protected]> wrote:
> >
> >
> >
> > On Thu, Nov 27, 2025 at 5:38 PM Jonathan Wakely <[email protected]>
> wrote:
> >>
> >> On Thu, 27 Nov 2025 at 16:01, Tomasz Kaminski <[email protected]>
> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Nov 27, 2025 at 4:35 PM Jonathan Wakely <[email protected]>
> wrote:
> >> >>
> >> >> The __spin_until_impl function was presumably intended to just spin
> for
> >> >> a short time, then give up and let the caller wait on a futex or
> >> >> condvar. However, __spin_until_impl will never stop spinning unless
> >> >> either the value changes or the timeout is reached. This means that
> when
> >> >> __spin_until_impl returns, the caller should immediately return
> (because
> >> >> either the value we were waiting for has changed, or the timeout
> >> >> happened). So __wait_until_impl should never block on a futex or
> >> >> condvar. However, the check for the return value of __spin_until_impl
> >> >> would only return if the value changed (i.e. !__res._M_timeout). So
> if
> >> >> the timeout occurred, it would fall through and block on the
> >> >> futex/condvar, even though the timeout has already been reached.
> >> >
> >> > Yes, it was calling sleep for max 64ms in iterations, and started
> spinning
> >> > close to timeout. Which is strange, because I would assume spin loop
> is
> >> > beneficial when we have nothing to wait on.
> >> >>
> >> >>
> >> >> This was causing a major performance regression in the timed waiting
> >> >> functions of std::counting_semaphore.
> >> >>
> >> >> The simplest fix is to replace __spin_until_impl entirely, just
> calling
> >> >> __spin_impl to spin a small, finite number of times, and then return
> >> >> immediately if either the value changed or the timeout happened. This
> >> >> ensures that we don't block on the futex/condvar unnecessarily.
> >> >>
> >> >> Removing __spin_until_impl also has the advantage that we no longer
> keep
> >> >> calling steady_clock::now() on every iteration to check for a
> timeout.
> >> >> That was also adding significant overhead to the timed waiting
> >> >> functions.
> >> >
> >> > LGTM.
> >> >>
> >> >>
> >> >> libstdc++-v3/ChangeLog:
> >> >>
> >> >>         PR libstdc++/122878
> >> >>         * src/c++20/atomic.cc (__spin_until_impl): Remove.
> >> >>         (__wait_until_impl): Use __spin_impl instead of
> >> >>         __spin_until_impl and return if timeout is reached after
> >> >>         spinning.
> >> >> ---
> >> >>
> >> >> Tested x86_64-linux.
> >> >>
> >> >>  libstdc++-v3/src/c++20/atomic.cc | 47
> ++------------------------------
> >> >>  1 file changed, 3 insertions(+), 44 deletions(-)
> >> >>
> >> >> diff --git a/libstdc++-v3/src/c++20/atomic.cc
> b/libstdc++-v3/src/c++20/atomic.cc
> >> >> index fdd67d834768..16fd91b7d7ab 100644
> >> >> --- a/libstdc++-v3/src/c++20/atomic.cc
> >> >> +++ b/libstdc++-v3/src/c++20/atomic.cc
> >> >> @@ -455,49 +455,6 @@ __cond_wait_until(__condvar& __cv, mutex& __mx,
> >> >>    return __wait_clock_t::now() < __atime;
> >> >>  }
> >> >>  #endif // ! HAVE_PLATFORM_WAIT
> >> >> -
> >> >> -// Unlike __spin_impl, does not always return _M_has_val == true.
> >> >> -// If the deadline has already passed then no fresh value is loaded.
> >> >> -__wait_result_type
> >> >> -__spin_until_impl(const __platform_wait_t* __addr,
> >> >> -                 const __wait_args_base& __args,
> >> >> -                 const __wait_clock_t::time_point& __deadline)
> >> >> -{
> >> >> -  using namespace literals::chrono_literals;
> >> >> -
> >> >> -  __wait_result_type __res{};
> >> >> -  auto __t0 = __wait_clock_t::now();
> >> >> -  auto __now = __t0;
> >> >> -  for (; __now < __deadline; __now = __wait_clock_t::now())
> >> >> -    {
> >> >> -      auto __elapsed = __now - __t0;
> >> >> -#ifndef _GLIBCXX_NO_SLEEP
> >> >> -      if (__elapsed > 128ms)
> >> >> -       this_thread::sleep_for(64ms);
> >> >> -      else if (__elapsed > 64us)
> >> >> -       this_thread::sleep_for(__elapsed / 2);
> >> >> -      else
> >> >> -#endif
> >> >> -      if (__elapsed > 4us)
> >> >> -       __thread_yield();
> >> >> -      else
> >> >> -       {
> >> >> -         __res = __detail::__spin_impl(__addr, __args);
> >> >> -         if (!__res._M_timeout)
> >> >> -           return __res;
> >> >> -       }
> >> >> -
> >> >> -      __res._M_val = __atomic_load_n(__addr, __args._M_order);
> >> >> -      __res._M_has_val = true;
> >> >> -      if (__res._M_val != __args._M_old)
> >> >> -       {
> >> >> -         __res._M_timeout = false;
> >> >> -         return __res;
> >> >> -       }
> >> >> -    }
> >> >> -  __res._M_timeout = true;
> >> >> -  return __res;
> >> >> -}
> >> >>  } // namespace
> >> >>
> >> >>  __wait_result_type
> >> >> @@ -509,11 +466,13 @@ __wait_until_impl([[maybe_unused]] const void*
> __addr, __wait_args_base& __args,
> >> >
> >> > Few lines before, we are passing wait_clock_t::duration with the time
> since epoch value here,
> >> > wouldn't it make more sense and make the implementation cleaner if we
> would pass wait_clock_t::time_point instead?
> >> > We have different types for duration and time_point for a reason.
> >>
> >> I should add a comment about that. Passing a duration instead of a
> >> time_point means that the choice of clock isn't baked in to the ABI.
> >> We could use the same entry point into the library to pass a
> >> system_clock::time_point if we assigned a new bit in the __wait_flags
> >> structure to mean it uses the system_clock not the steady_clock. Or we
> >> could use a flag to say that it's a relative time (as a duration) not
> >> an absolute time as a time_point.
> >
> > We still backe the choice of clock period in the API.
>
> Yes, but choosing int64_t representing nanoseconds for a timeout
> doesn't seem like it's over-constraining anything, and unlikely to be
> a decision we regret.
>
I am not referring to the specialization of duration used, but the fact that
parameter  is using wait_clock_t::duration, so in case if wait_clock_t clock
is changed, the parameter type will automatically change, and from what you
said we do not want that. Having a separate wait_epoch_time_t pointing
directly to
that specialization, will make intent clear for me.
(The signature would remain the same regardless of change.)

>
> > I would suggest having a typedef for wait_epoch_time_t instead
> > of using wait_clock_t::duration, that would be independent of wait_t.
> > Would be also a good place to add the comment you mentioned.
> >
> > Can be done in separate patch.
> >>
> >>
> >> So you're correct that we're using the "wrong" type here, but it's a
> >> form of type erasure (erasing the clock parameter) that makes the API
> >> more flexible.
> >>
> >> >
> >> > One of two calls points have time_point and then turns it into
> duration:
> >> >         auto __res = __detail::__wait_until_impl(__addr, __args,
> >> >
> __at.time_since_epoch());
> >> > And construct time point back in function. But this could be done as
> separate patch.
> >> >
> >> >>
> >> >>
> >> >>    if (__args & __wait_flags::__do_spin)
> >> >>      {
> >> >> -      auto __res = __detail::__spin_until_impl(__wait_addr, __args,
> __atime);
> >> >> +      auto __res = __detail::__spin_impl(__wait_addr, __args);
> >> >>        if (!__res._M_timeout)
> >> >>         return __res;
> >> >>        if (__args & __wait_flags::__spin_only)
> >> >>         return __res;
> >> >> +      if (__wait_clock_t::now() >= __atime)
> >> >> +       return __res;
> >> >>      }
> >> >>
> >> >>  #ifdef _GLIBCXX_HAVE_PLATFORM_WAIT
> >> >> --
> >> >> 2.51.1
> >> >>
> >>
>
>

Re: [PATCH] libstdc++: Fix spinloop in atomic timed waiting function [PR122878]

Reply via email to