On Tue, Nov 04, 2025 at 11:33:05AM +0000, Matthew Malcomson wrote:
> We've been meaning to send this email for a while after the Cauldron.
> IIUC you discussed this with Ramana there -- my understanding of what he
> told me is that your main concern is with the explosion of builtins.

Yeah.

> Similarly, what I understand is that if we reduce the number of builtins to
> just the `__atomic_fetch_min`, `__atomic_fetch_max`, `__atomic_min_fetch`
> and `__atomic_max_fetch` (which are the same builtins currently exposed by
> clang) that seems more acceptable.
> 
> I wanted to double-check that this understanding of a conversation I'm
> hearing second-hand is correct.

Yes.  Even min vs. max could be reduced to minmax with an extra argument
which determines which one it is, but I'm fine with those 4.
But just those type generic ones, not the _{1,2,4,8,16} suffixes thereof
etc., let the compiler figure out the signedness and size (and will it
handle floating point too or not?).

> Also I'd like to gather opinion (from you but also anyone who has an
> opinion) about the following implementation design decisions:
> 1) Should we still implement the libatomic functions?  And still with the
> unsigned/signed distinction and all sizes?
> - I'd expect so, mostly for the `-fno-inline-atomics` flag.

Do you really need that?  Can't you just emit a CAS loop for the non-inline
atomics?  Because the function explosion will be there again on the
libatomic side.  Or at least don't add such symbols on arches which are
never going to benefit from those right now (i.e. all the implementations
would be CAS loops anyway).  Some function explosion can be limited by only
having the __atomic_fetch_{min,max}_{s,u}{1,2,4,8,16} variants and not
the {min,max}_fetch ones, because that case can be implemented on the caller
side by just returning the DESIRED argument separately.

> 2) Earlier you floated the idea of using an internal function to encode the
> operation through the rest of the compiler (outside of the frontend).  Does
> that approach still seem good to you?

There are 2 options.
Lower the type-generic builtin into a CAS loop and pattern recognize it at
some late time (e.g. the widening_mul pass, certainly after IPA) into an IFN
if the corresponding optab is supported.
Or lower the type-generic builtin into IFN (ifns can have the min vs. max
argument and derive size and sign from the DESIRED argument) and at some
perhaps early (before IPA) point - forwprop? - pattern match a CAS loop into
the IFN too and then ideally shortly after IPA lower the IFN back into a CAS
loop if optab doesn't exist.
The reason for the pre vs. post-IPA is OpenMP/OpenACC, before IPA you don't
always know what the backend will be.

> 3) For those architectures that don't have relevant operations, I guess we
> expand the internal function to a CAS loop at cfgexpand pass in a manner
> similar to how we do for builtins -- does that sound reasonable?

Preferrably long before that, so that optimizations can optimize those if
possible.

> N.b. one of the negative points that Soumya pointed out about this approach
> is that it requires a bit of custom handling in libatomic.
> Points being:
> 1) We designed the ABI for libatomic functions to have signed/unsigned and
> sized versions of the functions.
> 2) The libatomic signed/unsigned functions no longer have a signed/unsigned
> builtin to directly call.
> 3) Hence we need to do some casting or macro magic in libatomic.

libatomic has already tons of macro magic.

        Jakub

Reply via email to