On Fri, Sep 05, 2025 at 10:30:49AM +0100, Matthew Malcomson wrote:
> Ok -- TBH I don't have any extra details on this argument right now and your
> point on it's feasibility seems quite convincing. (Timezones are slowing
> communication with other compiler team about why it's hard in their case --
> I may get more information later).
> 
> The other arguments towards a builtin I know of are less of a requirement
> and more about helping keep code standardised throughout the ecosystem:
> 
> - The design of the atomic builtins was to match the requirements for C++11.
> It would seem natural to me to keep matching the C++ standard as it evolves.
> (In this case also providing users writing C with a standard interface to
> use this functionality).
> 
> - Specifically for the fetch_min/fetch_max, the paper that proposed it be
> standardised discussed the forms of CAS loop that might be written and how
> the semantics of two of them are subtly different (section 5 of
> https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p0493r5.pdf). If we
> don't provide a user-interface then each user writing C would have to deal
> with the subtleties of atomic synchronisation vs aggressive optimisations
> themselves increasing possibility of some mistakes being made.

Without a hw instruction (which is the case on most targets, guess except
the LL/SC ones which implement everything as kind of a CAS loop but not
exactly), it will need to be implemented as a CAS loop anyway.
All builtins have a cost and having to add min + max times signed vs.
unsigned vs. maybe floating point (in that case it needs to be specific
which one, e.g. 2 byte can be both _Float16 and bf16 and perhaps others,
16 byte can be IEEE quad vs. IBM double double vs. Intel extended etc.)
times the 5 different sizes is a lot of builtins.  And either you just
write it once using a CAS loop in the headers (and I really believe it is
hard to obfuscate it too much after optimizations, worst case there will
be some casts, I wouldn't bother trying to optimize loops which atomic_load
in each iteration instead of using the new memory value from CAS) and
pattern match, or need to write it twice in the headers (if builtin is
supported vs. if it is not; this version is user unfriendly) or always
use builtin (but that is backwards incompatible, doesn't support older
compilers) and let the compiler lower them to CAS loops if there is no
optab.

        Jakub

Reply via email to