On 9/5/25 10:41, Jakub Jelinek wrote:
External email: Use caution opening links or attachments


On Fri, Sep 05, 2025 at 10:30:49AM +0100, Matthew Malcomson wrote:
Ok -- TBH I don't have any extra details on this argument right now and your
point on it's feasibility seems quite convincing. (Timezones are slowing
communication with other compiler team about why it's hard in their case --
I may get more information later).

The other arguments towards a builtin I know of are less of a requirement
and more about helping keep code standardised throughout the ecosystem:

- The design of the atomic builtins was to match the requirements for C++11.
It would seem natural to me to keep matching the C++ standard as it evolves.
(In this case also providing users writing C with a standard interface to
use this functionality).

- Specifically for the fetch_min/fetch_max, the paper that proposed it be
standardised discussed the forms of CAS loop that might be written and how
the semantics of two of them are subtly different (section 5 of
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p0493r5.pdf). If we
don't provide a user-interface then each user writing C would have to deal
with the subtleties of atomic synchronisation vs aggressive optimisations
themselves increasing possibility of some mistakes being made.

Without a hw instruction (which is the case on most targets, guess except
the LL/SC ones which implement everything as kind of a CAS loop but not
exactly), it will need to be implemented as a CAS loop anyway.
All builtins have a cost and having to add min + max times signed vs.
unsigned vs. maybe floating point (in that case it needs to be specific
which one, e.g. 2 byte can be both _Float16 and bf16 and perhaps others,
16 byte can be IEEE quad vs. IBM double double vs. Intel extended etc.)
times the 5 different sizes is a lot of builtins.  And either you just
write it once using a CAS loop in the headers (and I really believe it is
hard to obfuscate it too much after optimizations, worst case there will
be some casts, I wouldn't bother trying to optimize loops which atomic_load
in each iteration instead of using the new memory value from CAS) and
pattern match, or need to write it twice in the headers (if builtin is
supported vs. if it is not; this version is user unfriendly) or always
use builtin (but that is backwards incompatible, doesn't support older
compilers) and let the compiler lower them to CAS loops if there is no
optab.

         Jakub


To raise awareness to Joseph: I think these arguments Jakub is making also apply to the floating point fetch_add/fetch_sub series I put up last year and recently respun (and Joseph has been reviewing) https://gcc.gnu.org/pipermail/gcc-patches/2025-August/692329.html

-----
Our main aim is to have the new libstdc++ methods implemented with a builtin for other compilers (I'm getting clear indication that pattern-matching is fragile for our other compiler teams).

For GCC 15 we had a similar goal for the floating point fetch_add/fetch_sub methods. There we ended up with two implementations in libstdc++ and the choice made based on SFINAE (proposed in https://gcc.gnu.org/pipermail/libstdc++/2025-February/060377.html -- and Jonathon fixed the patch for us at https://gcc.gnu.org/pipermail/libstdc++/2025-February/060384.html)?

At the time we added that SFINAE decision we assumed that the patchset adding floating point variants for fetch_add/fetch_sub was going to go in for GCC 16.

Question for Jonathon: would it still be OK to have the SFINAE choice in libstdc++ based on the justification of clang having the builtins (clang currently having the generic atomic fetch_min/fetch_max builtins though notably not exposing the resolved/sized variants to the user) -- even if GCC has no plan to add these builtins?

N.b. for context we have the same goal for atomic reductions in C++26 (to make sure the size of the code changes in libstdc++ is apparent from the outset).

MM

Reply via email to