Hi there -- since you've not pushed quite yet can I ask that you use the following commit message instead of the existing one? (Updated a few comments to match what has changed since I wrote that message).

Apologies for not noticing it earlier.


----
libstdc++: Conditionally use floating-point fetch_add builtins

- Some hardware has support for floating point atomic fetch_add (and
  similar).
- There are existing compilers targetting this hardware that use
  libstdc++ -- e.g. NVC++.
- Since the libstdc++ atomic<float>::fetch_add and similar is written
  directly as a CAS loop these compilers can not emit optimal code when
  seeing such constructs.
- I hope to use __atomic_fetch_add builtins on floating point types
  directly in libstdc++ so these compilers can emit better code.
- Clang already handles some floating point types in the
  __atomic_fetch_add family of builtins.
- In order to only use this when available, I originally thought I could
  check against the resolved versions of the builtin in a manner
  something like `__has_builtin(__atomic_fetch_add_<fp-suffix>)`.
  I then realised that clang does not expose resolved versions of these
  atomic builtins to the user.
  From the clang discourse it was suggested we instead use SFINAE (which
  clang already supports).
- I have recently pushed a patch for allowing the use of SFINAE on
  builtins.
  https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html
  Now that patch is upstream, this patch does not change what happens
  for GCC, while it uses the builtin for codegen with clang.
- I have previously sent a patchset upstream adding the ability to use
  __atomic_fetch_add and similar on floating point types.
  https://gcc.gnu.org/pipermail/gcc-patches/2024-November/668754.html
  Once that patchset is upstream (plus the automatic linking of
  libatomic as Joseph pointed out in the email below
  https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665408.html )
  then current GCC should start to use the builtin branch added in this
  patch.

So *currently*, this patch allows external compilers (NVC++ in
particular) to generate better code, and similarly lets clang understand
the operation better since it maps to a known builtin.

I hope that by GCC 16 this patch would also allow GCC to understand the
operation better via mapping to a known builtin.

Testing done:
1) No change seen in bootstrap & regression test on AArch64
2) No change seen in bootstrap & regression test on x86_64 (Jonathan did
   this test).
3) Manually checked that when compiling with clang we follow the branch
   that uses the builtin for `float` (because clang has already
   implemented these builtins for `float`).
   - Done by adding `+1` to the result of the builtin branch and
     checking that we abort when running the result for each of
     fetch_add/fetch_sub/add_fetch/sub_fetch in turn (i.e. 4 separate
     manual checks).
4) Manually checked that when compiling with GCC we follow the branch
   that does not use the builtin for `float`.
   - Done this by adding the same temporary bug to the header in the
     builtin branch, and re-running tests to see that we still pass with
     GCC.

libstdc++-v3/ChangeLog:

        * include/bits/atomic_base.h (__atomic_fetch_addable): Define
        new concept.
        (__atomic_impl::__fetch_add_flt): Use new concept to make use of
        __atomic_fetch_add when available.
        (__atomic_fetch_subtractable, __fetch_sub_flt): Likewise.
        (__atomic_add_fetchable, __add_fetch_flt): Likewise.
        (__atomic_sub_fetchable, __sub_fetch_flt): Likewise.

Signed-off-by: Matthew Malcomson <mmalcom...@nvidia.com>
Co-authored-by: Jonathan Wakely <jwak...@redhat.com>

---


On 2/8/25 14:05, Jonathan Wakely wrote:
External email: Use caution opening links or attachments


On Sat, 8 Feb 2025 at 10:55, Matthew Malcomson <mmalcom...@nvidia.com> wrote:

Hi Jonathan!

Many thanks!  Will learn the libstdc++ style eventually.

I've run bootstrap & regression test on this, and did the manual checks
I mentioned before of compiling atomic_float/1.cc with clang and then
adding `+ 1` on the builtin codepath to check that the clang binary
aborts while the GCC built binary passes.

Great, thanks for checking it. I'll get this pushed early next week.



Thanks again!
Matthew

On 2/7/25 15:33, Jonathan Wakely wrote:
External email: Use caution opening links or attachments


On 05/02/25 13:43 +0000, Jonathan Wakely wrote:
On 28/10/24 17:15 +0000, mmalcom...@nvidia.com wrote:
From: Matthew Malcomson <mmalcom...@nvidia.com>

I noticed that the libstdc++ patch is essentially separate and figured I
could send it upstream earlier to give reviewers more time to look at
it.
I am still working on adding the ability to use floating point types in
the __atomic_fetch_add builtins

Review of current state and motivation (for anyone reading this that has
not already seen the previous patches):
- Some hardware has support for floating point atomic fetch_add (and
similar).
- There are existing compilers targetting this hardware that use
libstdc++ -- e.g. NVC++.
- Since the libstdc++ atomic<float>::fetch_add and similar is written
directly as a CAS loop these compilers can not emit optimal code when
seeing such constructs.
- I hope to use __atomic_fetch_add builtins on floating point types
directly in libstdc++ so these compilers can emit better code.
- Clang already handles some floating point types in the
__atomic_fetch_add family of builtins.
- In order to only use this when available, I originally thought I could
check against the resolved versions of the builtin in a manner
something like `__has_builtin(__atomic_fetch_add_<fp-suffix>)`.
I then realised that clang does not expose resolved versions of these
atomic builtins to the user.
 From the clang discourse it was suggested we instead use SFINAE (which
clang already supports).
- I have posted a patch for allowing the use of SFINAE on builtins (not
yet reviewed).
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html
*This* patch builds and regtests on top of that patch.  It does not
change what happens for GCC, while it uses the builtin for codegen with
clang.
- I have previously sent a patchset upstream adding the ability to use
__atomic_fetch_add and similar on floating point types.
https://gcc.gnu.org/pipermail/gcc-patches/2024-September/663355.html
I hope to send a full patchset up soon including the suggestions given
there.
With that patchset included (plus the automatic linking of libatomic
as Joseph pointed out in the email below
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665408.html )
then current GCC should start to use the builtin branch added in this
patch.

So *currently*, this patch allows external compilers (NVC++ in
particular) to generate better code, and similarly lets clang understand
the operation better since it maps to a known builtin.

I hope that by GCC 15 this patch would also allow GCC to understand the
operation better via mapping to a known builtin.

----------------- 8< ----------- >8 ----------------

Points to question here are:
1) Is this the best approach for using SFINAE to check if this builtin
  has a particular overload?
  Don't know of a better approach, but not an expert in C++ templating.

Concepts!

We still need the CAS loop fallback for any compiler that doesn't
implement this builtin.  Once all compilers we care about implement this
we can remove this special handling and merge the floating point and
integral operations into the same template.

Testing done:
N.b. all testing done on top of the patch introducing SFINAE on builtins
here, all testing done on AArch64:
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664999.html

1) No change seene in bootstrap & regression test on AArch64
2) Manually ran the atomic_float/1.cc testcase with clang++ 19 and
  things passed.  With clang++ 18 there was a failure independent to
  this change where the use of `is_lock_free` in the testcase was not
  optimised away so we got a linker error.  After editing the testcase
  it also passes with clang++ 18.
3) Manually checked that when compiling with clang we follow the branch
  that uses the builtin for `float` (because clang has already
  implemented these builtins for `float`).
  - Done by adding `+1` to the result of that branch and checking that
    we abort when running the result.
4) Manually checked that when compiling with GCC we follow the branch
  that does not use the builtin for `float`.
  - Done this by adding the same temporary bug to the header in the
    builtin branch, and re-running tests to see that we still pass with
    GCC.

Ok for trunk?

libstdc++-v3/ChangeLog:

      * include/bits/atomic_base.h
         (__atomic_impl::__is_atomic_fetch_add_available): Define new
         struct using SFINAE to check whether __atomic_fetch_add is
         implemented on floating point type.
         (std::__atomic_impl::__fetch_add_flt): `if constexpr` branch
         on the above SFINAE test to use __atomic_fetch_add when
         available.
         (__atomic_impl::__is_atomic_add_fetch_available,
         std::__atomic_impl::__add_fetch_flt): Likewise.
         (__atomic_impl::__is_atomic_fetch_sub_available,
         std::__atomic_impl::__fetch_sub_flt): Likewise.
         (__atomic_impl::__is_atomic_sub_fetch_available,
         std::__atomic_impl::__sub_fetch_flt): Likewise.

Signed-off-by: Matthew Malcomson <mmalcom...@nvidia.com>
---
libstdc++-v3/include/bits/atomic_base.h | 116 ++++++++++++++++++------
1 file changed, 90 insertions(+), 26 deletions(-)

diff --git a/libstdc++-v3/include/bits/atomic_base.h b/libstdc++-v3/
include/bits/atomic_base.h
index 72cc4bae6cf..d7671b0e9d2 100644
--- a/libstdc++-v3/include/bits/atomic_base.h
+++ b/libstdc++-v3/include/bits/atomic_base.h
@@ -1209,54 +1209,118 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
      __xor_fetch(_Tp* __ptr, _Val<_Tp> __i) noexcept
      { return __atomic_xor_fetch(__ptr, __i, __ATOMIC_SEQ_CST); }

+      template <typename _Tp, typename = void>
+      struct __is_atomic_fetch_add_available : false_type
+      {
+      };
+      template <typename T>

This T isn't uglified as _Tp, but it doesn't matter because ...

+      struct __is_atomic_fetch_add_available<
+     T, std::void_t<decltype (__atomic_fetch_add (
+          std::declval<T *> (), std::declval<T> (), int ()))>> :
true_type
+      {
+      };
+

Atomics for floating-point types are only in C++20, so we can use
Concepts for these constraints.

template<typename _Tp>
  concept __atomic_fetch_addable
    = requires (_Tp __t) { __atomic_fetch_add(&__t, __t, 1); };

I assume this will work fine if the void_t version worked.

    template<typename _Tp>
      _Tp
      __fetch_add_flt(_Tp* __ptr, _Val<_Tp> __i, memory_order __m)
noexcept
      {
-     _Val<_Tp> __oldval = load(__ptr, memory_order_relaxed);
-     _Val<_Tp> __newval = __oldval + __i;
-     while (!compare_exchange_weak(__ptr, __oldval, __newval, __m,
-                                   memory_order_relaxed))
-       __newval = __oldval + __i;
-     return __oldval;
+     if constexpr (__is_atomic_fetch_add_available<_Tp>::value)

Then this would be just:

       if constexpr (__atomic_fetch_addable<_Tp>)

i.e. no ::value

+       return __atomic_fetch_add (__ptr, __i, int (__m));

Also, we don't put a space before the opening paren in libstdc++ code.
That weird GNU-ism isn't used here :-)

I've attached a revised patch with my suggested changes. I've tested
this with GCC trunk on x86_64-linux. Could you re-run your tests with
this please?





Reply via email to