Re: [RFC] Builtins for C++26 integer atomic reductions

Matthew Malcomson via Gcc Thu, 28 May 2026 05:51:14 -0700

After offline discussion I thought it might be helpful to give a moredetailed explanation of the reason the memory-model semantics requirenew builtins for these reduction operations. Below I have an explicitwork-through of the first example given in the proposal for reductionoperations (the one Soumya referenced) -- http://wg21.link/p3111 .


The examples below are all in a "litmus" test format which is the
format for memory model simulating programs.  Initial values are at
the top, each function is performed by a different thread, at the
bottom we assert that it's not possible for thread 2 to have r1=2 and
thread 1 to have r0=0.  There are some C litmus tests and some ASM
litmus tests.  The ASM litmus tests can be ran & checked online at
https://diy.inria.fr/www/ without downloading anything, I believe the
C litmus tests require downloading a relevant tool -- for the
reduction operations you need https://github.com/hernanponcedeleon/Dat3M


Here we show that the behaviour of a relaxed atomic load & CAS blocks
reordering around the thread fence.  This happens as a fence-fence
synchronisation is formed between the fences in P1 and P0 if the read in
P1 reads the value written in P0.  N.b. Though not necessary I check
`worked` as well to make it clear this is the case when the CAS
succeeds and hence we don't have to ask the model to check a loop.
#+BEGIN_EXAMPLE
    C CAS
    { [x] = 0; [y] = 0; }
        P0(atomic_int* y,atomic_int* x) {
          atomic_store_explicit(x,1,memory_order_relaxed);
          atomic_thread_fence(memory_order_release);
          atomic_store_explicit(y,1,memory_order_relaxed);
        }

        P1(atomic_int* y,atomic_int* x, atomic_int *val) {
          int v = atomic_load_explicit(y, memory_order_relaxed);
          *val = v;

int worked = atomic_compare_exchange_strong_explicit(y, val,v+1, memory_order_relaxed, memory_order_relaxed);

          atomic_thread_fence(memory_order_acquire);
          int r0 = atomic_load_explicit(x,memory_order_relaxed);
        }

        P2(atomic_int* y) {
          int r1 = atomic_load_explicit(y,memory_order_relaxed);
        }
        ~exists (2:r1=2 /\ 1:r0=0 /\ 1:worked=1)
#+END_EXAMPLE

When running this under a memory model simulator we see that it
claims the property PASS'es -- i.e. that the above set of values is
not possible.
#+BEGIN_EXAMPLE
    vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
    vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
    vshcmd: > echo && \
    vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
    vshcmd: >      ${DAT3M_HOME}/cat/rc11.cat \
    vshcmd: >      --property=program_spec \
    vshcmd: >      --target=c11 \
    vshcmd: >      cas-example.litmus
    Dat3M [14:42:27] $ Dat3M [14:42:27] $ > > > > >
    Test: cas-example.litmus
    Result: PASS
    Time: 0.299 secs
    Dat3M [14:42:28] $
#+END_EXAMPLE

The current codegen for this (massaged a little to remove branches)
gives us the below litmus test.  Similar to the C litmus tests this
has initialisation at the top, but here what each thread does is
written in a different column.
#+BEGIN_EXAMPLE
    AArch64 Instruction cas
    {
       0:X0=y; 0:X1=x; 0:X2=0; 0:X3=0; 0:X4=0;
       1:X0=y; 1:X1=x; 1:X2=0; 1:X3=0; 1:X4=0;
       2:X0=y; 2:X1=x; 2:X2=0; 2:X3=0; 2:X4=0;
    }
    P0               |P1                   |P2                ;
    MOV     X2, #1   |LDR     X2, [X0]     |LDR     X3, [X0]  ;
    STR     X2, [X1] |ADD     X4, X2, #1   |                  ;
    DMB     ISHLD    |MOV     X3, X2       |                  ;
    DMB     ISHST    |CAS     X3, X4, [X0] |                  ;
    STR     X2, [X0] |DMB     ISHLD        |                  ;
                     |LDR     X4, [X1]     |                  ;
    ~exists (2:X3=2 /\ 1:X4=0 /\ 1:X2=1:X3)
#+END_EXAMPLE

As should be the case, this satisfies the limitation that the source
code requires.
#+BEGIN_EXAMPLE
    vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
    vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
    vshcmd: > echo && \
    vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
    vshcmd: >      ${DAT3M_HOME}/cat/aarch64.cat \
    vshcmd: >      --property=program_spec \
    vshcmd: >      --target=arm8 \
    vshcmd: >      cas-asm-example.litmus
    Dat3M [14:47:15] $ Dat3M [14:47:15] $ > > > > >
    Test: cas-asm-example.litmus
    Result: PASS
    Time: 0.254 secs
    Dat3M [14:47:16] $
#+END_EXAMPLE

This could be pattern-matched to an LDADD (equivalent of fetch_op):
#+BEGIN_EXAMPLE
    AArch64 Instruction fetch_op
    {
       0:X0=y; 0:X1=x; 0:X2=0; 0:X3=0; 0:X4=0;
       1:X0=y; 1:X1=x; 1:X2=0; 1:X3=0; 1:X4=0;
       2:X0=y; 2:X1=x; 2:X2=0; 2:X3=0; 2:X4=0;
    }
    P0               |P1                   |P2                ;
    MOV     X2, #1   |MOV     X2, #1       |LDR     X3, [X0]  ;
    STR     X2, [X1] |LDADD   X2, X2, [X0] |                  ;
    DMB     ISHLD    |DMB     ISHLD        |                  ;
    DMB     ISHST    |LDR     X4, [X1]     |                  ;
    STR     X2, [X0] |                     |                  ;
    ~exists (2:X3=2 /\ 1:X4=0)
#+END_EXAMPLE
#+BEGIN_EXAMPLE
    vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
    vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
    vshcmd: > echo && \
    vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
    vshcmd: >      ${DAT3M_HOME}/cat/aarch64.cat \
    vshcmd: >      --property=program_spec \
    vshcmd: >      --target=arm8 \
    vshcmd: >      fetch-asm-example.litmus
    Dat3M [14:49:49] $ Dat3M [14:49:49] $ > > > > >
    Test: fetch-asm-example.litmus
    Result: PASS
    Time: 0.259 secs
    Dat3M [14:49:50] $
#+END_EXAMPLE

But the instruction sequence we're looking to enable is STADD instead
of LDADD.  This does not trigger the same fence-fence
synchronisation:
#+BEGIN_EXAMPLE
    AArch64 Instruction reduction
    {
       0:X0=y; 0:X1=x; 0:X2=0; 0:X3=0; 0:X4=0;
       1:X0=y; 1:X1=x; 1:X2=0; 1:X3=0; 1:X4=0;
       2:X0=y; 2:X1=x; 2:X2=0; 2:X3=0; 2:X4=0;
    }
    P0               |P1               |P2                ;
    MOV     X2, #1   |MOV     X2, #1   |LDR     X3, [X0]  ;
    STR     X2, [X1] |STADD   X2, [X0] |                  ;
    DMB     ISHLD    |DMB     ISHLD    |                  ;
    DMB     ISHST    |LDR     X4, [X1] |                  ;
    STR     X2, [X0] |                 |                  ;
    ~exists (2:X3=2 /\ 1:X4=0)
#+END_EXAMPLE
#+BEGIN_EXAMPLE
    vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
    vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
    vshcmd: > echo && \
    vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
    vshcmd: >      ${DAT3M_HOME}/cat/aarch64.cat \
    vshcmd: >      --property=program_spec \
    vshcmd: >      --target=arm8 \
    vshcmd: >      reduction-asm-example.litmus
    Dat3M [15:38:02] $ Dat3M [15:38:02] $ > > > > >
    Test: reduction-asm-example.litmus
    Result: FAIL
    Reason: Program specification violation found
    Condition: not_exists (2:X3 == bv64(2)) && (1:X4 == bv64(0))
    Time: 0.271 secs
    Dat3M [15:38:03] $
#+END_EXAMPLE

The new C++26 reductions have the same relaxation as this STADD
instruction.
#+BEGIN_EXAMPLE
    C Reduction behaviour
    { [x] = 0; [y] = 0; }
        P0(atomic_int* y,atomic_int* x) {
          atomic_store_explicit(x,1,memory_order_relaxed);
          atomic_thread_fence(memory_order_release);
          atomic_store_explicit(y,1,memory_order_relaxed);
        }

        P1(atomic_int* y,atomic_int* x) {
          atomic_store_add_explicit(y,1,memory_order_relaxed);
          atomic_thread_fence(memory_order_acquire);
          int r0 = atomic_load_explicit(x,memory_order_relaxed);
        }

        P2(atomic_int* y) {
          int r1 = atomic_load_explicit(y,memory_order_relaxed);
        }
        ~exists (2:r1=2 /\ 1:r0=0)
#+END_EXAMPLE
#+BEGIN_EXAMPLE
    vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
    vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
    vshcmd: > echo && \
    vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
    vshcmd: >      ${DAT3M_HOME}/cat/rc11.cat \
    vshcmd: >      --property=program_spec \
    vshcmd: >      --target=c11 \
    vshcmd: >      reduction-example.litmus
    Dat3M [14:41:13] $ Dat3M [14:41:13] $ > > > > >
    Test: reduction-example.litmus
    Result: FAIL
    Reason: Program specification violation found
    Condition: not_exists (2:r1 == bv64(2)) && (1:r0 == bv64(0))
    Time: 0.253 secs
    Dat3M [14:41:13] $
#+END_EXAMPLE

This means that the CAS loop source-level construct specifies
restrictions that block the use of STADD (and similar for other "not
a read" instructions implementing different operations), but the new
reductions do not.  This is why we believe builtins are necessary to
emit the most optimal code for these reductions.

It is possible to implement the reductions correctly without newbuiltins -- just not optimally.


On 4/14/26 11:21, Soumya AR wrote:

Ping. (Also summarizing some relevant discussions on IRC)

We don't think these can be implemented using pattern-matching, as fetch_ops or
CAS loops produce visible reads, while reductions do not.

Therefore, pattern-matching an unused fetch_op result into a reduction would
remove the visible read, which could violate memory orderings.

Even if the fetch_op is in RELAXED memory ordering, fences can still synchronize
on that visible read, as is seen in the Litmus test from Will Deacon in
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3111r8.html.

Given these constraints, and knowing the best possible approach is ideally
one that uses the least builtins, is it OK to implement this using Option 1?

Thanks,
Soumya

On 24 Feb 2026, at 11:57 AM, Soumya AR <[email protected]> wrote:

Hi,

I'm working on implementing support for C++26 atomic reductions in GCC.

Corresponding paper for this:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3111r8.html

I'd like to get some insight into the preferable builtin interface. As I see it,
we have the following options:

Option 1 (that we expect to implement): One generic builtin per new operation

We add the generic builtin for each new operation and resolve it in the
middle-end later. This is similar to the atomic int fetch_min/max implementation
in this patch:

https://gcc.gnu.org/pipermail/gcc-patches/2026-January/706242.html

This would result in 7 new builtins, 1 for each new operation.

Option 2: Explicit builtins per operation and their types:

This would follow the existing pattern and add a new family of builtins.
Assuming we name them __atomic_store_<op>, we would have:
__atomic_store_add,
__atomic_store_sub,
__atomic_store_and,
__atomic_store_or,
__atomic_store_xor,
__atomic_store_max,
__atomic_store_min,
and all of their typed variants (_N, _1, _2, _4, _8, _16), resulting in 42 new
builtins.

However, this would again result in the earlier issue of a builtin explosion
wrt integer fetch min/max discussed here:
https://gcc.gnu.org/pipermail/gcc-patches/2025-November/699602.html

Option 3: Extend existing __atomic_fetch_* builtins

Add a parameter to the existing __atomic_fetch_* builtins to choose their
no-read / reduction variant.

Although this leads to no new builtins added, the name fetch_* would become a
misnomer for these and lead to a confusing user interface.

Would appreciate some feedback on what would be the best step forward here for
GCC. Will accordingly get feedback from the LLVM community on this as well.

Best,
Soumya

Re: [RFC] Builtins for C++26 integer atomic reductions

Reply via email to