On 9/5/25 10:41, Jakub Jelinek wrote:
External email: Use caution opening links or attachments
On Fri, Sep 05, 2025 at 10:30:49AM +0100, Matthew Malcomson wrote:
Ok -- TBH I don't have any extra details on this argument right now and your
point on it's feasibility seems quite convincing. (Timezones are slowing
communication with other compiler team about why it's hard in their case --
I may get more information later).
The other arguments towards a builtin I know of are less of a requirement
and more about helping keep code standardised throughout the ecosystem:
- The design of the atomic builtins was to match the requirements for C++11.
It would seem natural to me to keep matching the C++ standard as it evolves.
(In this case also providing users writing C with a standard interface to
use this functionality).
- Specifically for the fetch_min/fetch_max, the paper that proposed it be
standardised discussed the forms of CAS loop that might be written and how
the semantics of two of them are subtly different (section 5 of
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p0493r5.pdf). If we
don't provide a user-interface then each user writing C would have to deal
with the subtleties of atomic synchronisation vs aggressive optimisations
themselves increasing possibility of some mistakes being made.
Without a hw instruction (which is the case on most targets, guess except
the LL/SC ones which implement everything as kind of a CAS loop but not
exactly), it will need to be implemented as a CAS loop anyway.
All builtins have a cost and having to add min + max times signed vs.
unsigned vs. maybe floating point (in that case it needs to be specific
which one, e.g. 2 byte can be both _Float16 and bf16 and perhaps others,
16 byte can be IEEE quad vs. IBM double double vs. Intel extended etc.)
times the 5 different sizes is a lot of builtins. And either you just
write it once using a CAS loop in the headers (and I really believe it is
hard to obfuscate it too much after optimizations, worst case there will
be some casts, I wouldn't bother trying to optimize loops which atomic_load
in each iteration instead of using the new memory value from CAS) and
pattern match, or need to write it twice in the headers (if builtin is
supported vs. if it is not; this version is user unfriendly) or always
use builtin (but that is backwards incompatible, doesn't support older
compilers) and let the compiler lower them to CAS loops if there is no
optab.
Jakub
Hi Jakub,
We've been meaning to send this email for a while after the Cauldron.
IIUC you discussed this with Ramana there -- my understanding of what he
told me is that your main concern is with the explosion of builtins.
Similarly, what I understand is that if we reduce the number of builtins
to just the `__atomic_fetch_min`, `__atomic_fetch_max`,
`__atomic_min_fetch` and `__atomic_max_fetch` (which are the same
builtins currently exposed by clang) that seems more acceptable.
I wanted to double-check that this understanding of a conversation I'm
hearing second-hand is correct.
Also I'd like to gather opinion (from you but also anyone who has an
opinion) about the following implementation design decisions:
1) Should we still implement the libatomic functions? And still with
the unsigned/signed distinction and all sizes?
- I'd expect so, mostly for the `-fno-inline-atomics` flag.
2) Earlier you floated the idea of using an internal function to encode
the operation through the rest of the compiler (outside of the
frontend). Does that approach still seem good to you?
3) For those architectures that don't have relevant operations, I guess
we expand the internal function to a CAS loop at cfgexpand pass in a
manner similar to how we do for builtins -- does that sound reasonable?
N.b. one of the negative points that Soumya pointed out about this
approach is that it requires a bit of custom handling in libatomic.
Points being:
1) We designed the ABI for libatomic functions to have signed/unsigned
and sized versions of the functions.
2) The libatomic signed/unsigned functions no longer have a
signed/unsigned builtin to directly call.
3) Hence we need to do some casting or macro magic in libatomic.
Overall, does the above sound reasonable?
Regards,
Matthew.