After offline discussion I thought it might be helpful to give a more
detailed explanation of the reason the memory-model semantics require
new builtins for these reduction operations. Below I have an explicit
work-through of the first example given in the proposal for reduction
operations (the one Soumya referenced) -- http://wg21.link/p3111 .
The examples below are all in a "litmus" test format which is the
format for memory model simulating programs. Initial values are at
the top, each function is performed by a different thread, at the
bottom we assert that it's not possible for thread 2 to have r1=2 and
thread 1 to have r0=0. There are some C litmus tests and some ASM
litmus tests. The ASM litmus tests can be ran & checked online at
https://diy.inria.fr/www/ without downloading anything, I believe the
C litmus tests require downloading a relevant tool -- for the
reduction operations you need https://github.com/hernanponcedeleon/Dat3M
Here we show that the behaviour of a relaxed atomic load & CAS blocks
reordering around the thread fence. This happens as a fence-fence
synchronisation is formed between the fences in P1 and P0 if the read in
P1 reads the value written in P0. N.b. Though not necessary I check
`worked` as well to make it clear this is the case when the CAS
succeeds and hence we don't have to ask the model to check a loop.
#+BEGIN_EXAMPLE
C CAS
{ [x] = 0; [y] = 0; }
P0(atomic_int* y,atomic_int* x) {
atomic_store_explicit(x,1,memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store_explicit(y,1,memory_order_relaxed);
}
P1(atomic_int* y,atomic_int* x, atomic_int *val) {
int v = atomic_load_explicit(y, memory_order_relaxed);
*val = v;
int worked = atomic_compare_exchange_strong_explicit(y, val,
v+1, memory_order_relaxed, memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
int r0 = atomic_load_explicit(x,memory_order_relaxed);
}
P2(atomic_int* y) {
int r1 = atomic_load_explicit(y,memory_order_relaxed);
}
~exists (2:r1=2 /\ 1:r0=0 /\ 1:worked=1)
#+END_EXAMPLE
When running this under a memory model simulator we see that it
claims the property PASS'es -- i.e. that the above set of values is
not possible.
#+BEGIN_EXAMPLE
vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
vshcmd: > echo && \
vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
vshcmd: > ${DAT3M_HOME}/cat/rc11.cat \
vshcmd: > --property=program_spec \
vshcmd: > --target=c11 \
vshcmd: > cas-example.litmus
Dat3M [14:42:27] $ Dat3M [14:42:27] $ > > > > >
Test: cas-example.litmus
Result: PASS
Time: 0.299 secs
Dat3M [14:42:28] $
#+END_EXAMPLE
The current codegen for this (massaged a little to remove branches)
gives us the below litmus test. Similar to the C litmus tests this
has initialisation at the top, but here what each thread does is
written in a different column.
#+BEGIN_EXAMPLE
AArch64 Instruction cas
{
0:X0=y; 0:X1=x; 0:X2=0; 0:X3=0; 0:X4=0;
1:X0=y; 1:X1=x; 1:X2=0; 1:X3=0; 1:X4=0;
2:X0=y; 2:X1=x; 2:X2=0; 2:X3=0; 2:X4=0;
}
P0 |P1 |P2 ;
MOV X2, #1 |LDR X2, [X0] |LDR X3, [X0] ;
STR X2, [X1] |ADD X4, X2, #1 | ;
DMB ISHLD |MOV X3, X2 | ;
DMB ISHST |CAS X3, X4, [X0] | ;
STR X2, [X0] |DMB ISHLD | ;
|LDR X4, [X1] | ;
~exists (2:X3=2 /\ 1:X4=0 /\ 1:X2=1:X3)
#+END_EXAMPLE
As should be the case, this satisfies the limitation that the source
code requires.
#+BEGIN_EXAMPLE
vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
vshcmd: > echo && \
vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
vshcmd: > ${DAT3M_HOME}/cat/aarch64.cat \
vshcmd: > --property=program_spec \
vshcmd: > --target=arm8 \
vshcmd: > cas-asm-example.litmus
Dat3M [14:47:15] $ Dat3M [14:47:15] $ > > > > >
Test: cas-asm-example.litmus
Result: PASS
Time: 0.254 secs
Dat3M [14:47:16] $
#+END_EXAMPLE
This could be pattern-matched to an LDADD (equivalent of fetch_op):
#+BEGIN_EXAMPLE
AArch64 Instruction fetch_op
{
0:X0=y; 0:X1=x; 0:X2=0; 0:X3=0; 0:X4=0;
1:X0=y; 1:X1=x; 1:X2=0; 1:X3=0; 1:X4=0;
2:X0=y; 2:X1=x; 2:X2=0; 2:X3=0; 2:X4=0;
}
P0 |P1 |P2 ;
MOV X2, #1 |MOV X2, #1 |LDR X3, [X0] ;
STR X2, [X1] |LDADD X2, X2, [X0] | ;
DMB ISHLD |DMB ISHLD | ;
DMB ISHST |LDR X4, [X1] | ;
STR X2, [X0] | | ;
~exists (2:X3=2 /\ 1:X4=0)
#+END_EXAMPLE
#+BEGIN_EXAMPLE
vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
vshcmd: > echo && \
vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
vshcmd: > ${DAT3M_HOME}/cat/aarch64.cat \
vshcmd: > --property=program_spec \
vshcmd: > --target=arm8 \
vshcmd: > fetch-asm-example.litmus
Dat3M [14:49:49] $ Dat3M [14:49:49] $ > > > > >
Test: fetch-asm-example.litmus
Result: PASS
Time: 0.259 secs
Dat3M [14:49:50] $
#+END_EXAMPLE
But the instruction sequence we're looking to enable is STADD instead
of LDADD. This does not trigger the same fence-fence
synchronisation:
#+BEGIN_EXAMPLE
AArch64 Instruction reduction
{
0:X0=y; 0:X1=x; 0:X2=0; 0:X3=0; 0:X4=0;
1:X0=y; 1:X1=x; 1:X2=0; 1:X3=0; 1:X4=0;
2:X0=y; 2:X1=x; 2:X2=0; 2:X3=0; 2:X4=0;
}
P0 |P1 |P2 ;
MOV X2, #1 |MOV X2, #1 |LDR X3, [X0] ;
STR X2, [X1] |STADD X2, [X0] | ;
DMB ISHLD |DMB ISHLD | ;
DMB ISHST |LDR X4, [X1] | ;
STR X2, [X0] | | ;
~exists (2:X3=2 /\ 1:X4=0)
#+END_EXAMPLE
#+BEGIN_EXAMPLE
vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
vshcmd: > echo && \
vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
vshcmd: > ${DAT3M_HOME}/cat/aarch64.cat \
vshcmd: > --property=program_spec \
vshcmd: > --target=arm8 \
vshcmd: > reduction-asm-example.litmus
Dat3M [15:38:02] $ Dat3M [15:38:02] $ > > > > >
Test: reduction-asm-example.litmus
Result: FAIL
Reason: Program specification violation found
Condition: not_exists (2:X3 == bv64(2)) && (1:X4 == bv64(0))
Time: 0.271 secs
Dat3M [15:38:03] $
#+END_EXAMPLE
The new C++26 reductions have the same relaxation as this STADD
instruction.
#+BEGIN_EXAMPLE
C Reduction behaviour
{ [x] = 0; [y] = 0; }
P0(atomic_int* y,atomic_int* x) {
atomic_store_explicit(x,1,memory_order_relaxed);
atomic_thread_fence(memory_order_release);
atomic_store_explicit(y,1,memory_order_relaxed);
}
P1(atomic_int* y,atomic_int* x) {
atomic_store_add_explicit(y,1,memory_order_relaxed);
atomic_thread_fence(memory_order_acquire);
int r0 = atomic_load_explicit(x,memory_order_relaxed);
}
P2(atomic_int* y) {
int r1 = atomic_load_explicit(y,memory_order_relaxed);
}
~exists (2:r1=2 /\ 1:r0=0)
#+END_EXAMPLE
#+BEGIN_EXAMPLE
vshcmd: > export DAT3M_HOME=$HOME/repos/memory-model-tools/Dat3M
vshcmd: > export DAT3M_OUTPUT=${DAT3M_HOME}/output
vshcmd: > echo && \
vshcmd: > java -jar ${DAT3M_HOME}/dartagnan/target/dartagnan.jar \
vshcmd: > ${DAT3M_HOME}/cat/rc11.cat \
vshcmd: > --property=program_spec \
vshcmd: > --target=c11 \
vshcmd: > reduction-example.litmus
Dat3M [14:41:13] $ Dat3M [14:41:13] $ > > > > >
Test: reduction-example.litmus
Result: FAIL
Reason: Program specification violation found
Condition: not_exists (2:r1 == bv64(2)) && (1:r0 == bv64(0))
Time: 0.253 secs
Dat3M [14:41:13] $
#+END_EXAMPLE
This means that the CAS loop source-level construct specifies
restrictions that block the use of STADD (and similar for other "not
a read" instructions implementing different operations), but the new
reductions do not. This is why we believe builtins are necessary to
emit the most optimal code for these reductions.
It is possible to implement the reductions correctly without new
builtins -- just not optimally.
On 4/14/26 11:21, Soumya AR wrote:
Ping. (Also summarizing some relevant discussions on IRC)
We don't think these can be implemented using pattern-matching, as fetch_ops or
CAS loops produce visible reads, while reductions do not.
Therefore, pattern-matching an unused fetch_op result into a reduction would
remove the visible read, which could violate memory orderings.
Even if the fetch_op is in RELAXED memory ordering, fences can still synchronize
on that visible read, as is seen in the Litmus test from Will Deacon in
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3111r8.html.
Given these constraints, and knowing the best possible approach is ideally
one that uses the least builtins, is it OK to implement this using Option 1?
Thanks,
Soumya
On 24 Feb 2026, at 11:57 AM, Soumya AR <[email protected]> wrote:
Hi,
I'm working on implementing support for C++26 atomic reductions in GCC.
Corresponding paper for this:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2025/p3111r8.html
I'd like to get some insight into the preferable builtin interface. As I see it,
we have the following options:
Option 1 (that we expect to implement): One generic builtin per new operation
We add the generic builtin for each new operation and resolve it in the
middle-end later. This is similar to the atomic int fetch_min/max implementation
in this patch:
https://gcc.gnu.org/pipermail/gcc-patches/2026-January/706242.html
This would result in 7 new builtins, 1 for each new operation.
Option 2: Explicit builtins per operation and their types:
This would follow the existing pattern and add a new family of builtins.
Assuming we name them __atomic_store_<op>, we would have:
__atomic_store_add,
__atomic_store_sub,
__atomic_store_and,
__atomic_store_or,
__atomic_store_xor,
__atomic_store_max,
__atomic_store_min,
and all of their typed variants (_N, _1, _2, _4, _8, _16), resulting in 42 new
builtins.
However, this would again result in the earlier issue of a builtin explosion
wrt integer fetch min/max discussed here:
https://gcc.gnu.org/pipermail/gcc-patches/2025-November/699602.html
Option 3: Extend existing __atomic_fetch_* builtins
Add a parameter to the existing __atomic_fetch_* builtins to choose their
no-read / reduction variant.
Although this leads to no new builtins added, the name fetch_* would become a
misnomer for these and lead to a confusing user interface.
Would appreciate some feedback on what would be the best step forward here for
GCC. Will accordingly get feedback from the LLVM community on this as well.
Best,
Soumya