https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #23 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
(In reply to torvald from comment #22)
> (In reply to James Greenhalgh from comment #12)
> > There are two problems here, one of which concerns me more in the real
> > world, and both of which rely on races if you are in the C/C++11 model -
> > there isn't much I can do about that as the __sync primitives are legacy and
> > give stronger guarantees about "visible memory references" (which includes
> > ALL global data).
> >
> > Consider:
> >
> > int x = 0;
> > int y = 0;
> > int z = 0;
> >
> > void foo (void)
> > {
> > x = 1;
> > __sync_fetch_and_add (&y, 1);
> > z = 1;
> > }
> >
> > My reading of what a full barrier implies would suggest that we should never
> > have a modification order which is any of:
> >
> > z = 1, x = 1, y = 0+1
> > z = 1, y = 0+1, x = 1
> > x = 1, z = 1, y = 0+1
>
> At least in C11/C++11, modification orders are per memory location. If we
> want to have something that orders accesses to x, y, and z, we need to model
> the observer side too (which the C11 model does, for example). I'm
> mentioning that because at the very least, we have compiler reordering
> happening at the observer side; if a programmer would need to use, for
> example, __sync builtins to order the observers, then this means we "just"
> have to consider combinations of those on the modification and the
> observation side.
>
> (And because I'm not sure what you think "modification order" is, I can't
> comment further.)
Yes. I'm sorry, I was imprecise in my language as I couldn't quite find the
right phrase in specifications - what I am trying to get at is: the order in
which observers in the system are able to observe the writes to x, y, z. Later
in this post I'll call this "observation order", until I can find a better
terminology.
I believe the __sync builtins should guarantee that an observer could trust:
assert (z == 1 && y == 1 && x ==1);
to hold.
That is, the write to z is observed if and only if the writes to y and x have
also been observed.
> > GCC 5.0-ish (recent trunk) will emit:
> >
> > foo:
> > adrp x3, .LANCHOR0
> > mov w1, 1
> > add x0, x3, :lo12:.LANCHOR0
> > add x2, x0, 4
> > str w1, [x3, #:lo12:.LANCHOR0]
> > .L2:
> > ldaxr w3, [x2]
> > add w3, w3, w1
> > stlxr w4, w3, [x2]
> > cbnz w4, .L2
> > str w1, [x0, 8]
> > ret
> >
> > Dropping some of the context and switching to pseudo-asm we have:
> >
> > str 1, [x]
> > .L2:
> > ldaxr tmp, [y]
> > add tmp, tmp, 1
> > stlxr flag, tmp, [y]
> > cbnz flag, .L2
> > str 1, [z]
> >
> > As far as I understand it, the memory ordering guarantees of the
> > half-barrier LDAXR/STLXR instructions do not prevent this being reordered to
> > an execution which looks like:
> >
> > ldaxr tmp, [y]
> > str 1, [z]
> > str 1, [x]
> > add tmp, tmp, 1
> > stlxr flag, tmp, [y]
> > cbnz flag, .L2
> >
> > which gives one of the modification orders we wanted to forbid above.
> > Similar reordering can give the other undesirable modification orders.
>
> (I'm not familiar with the ARM model, so please bear with me.)
>
> This is reordering the HW can do? Or are you concerned about the compiler
> backend?
The architectural description of the memory model - this is reordering the HW
is permitted to do.
> Would the reordered str always become visible atomically with the stlxr?
> Would it become visible even if the LLSC fails, thus potentially storing
> more than once to z? This would be surprising in that the ldaxr would then
> have to be reloaded too potentially after the store to z, I believe (at
> least for a strong CAS) -- which would break the acquire MO on the load.
The STR would not become visible until after the stlxr had resolved as
successful, however, it can take a position of the observation order ahead of
the stlxr, such the the ordering of writes seen by observers would not be
program order.
> > As mentioned, this reordering is permitted under C11, as the stores to x and
> > z are racy - this permits the CPPMEM/Cambridge documentation of what an
> > SEQ_CST must do.
>
> That's true in this example, but at the instruction level (assuming HW
> reordering is what you're concerned about), a atomic relaxed-MO load isn't
> distinguishable from a normal memory access, right? So, it's not DRF what
> this is strictly about, but the difference between C11 seq_cst fences and
> seq_cst RMW ops.
Absolutely! I was trying to preempt "That program is racy" responses :).
> > However, my feeling is that it is at the very least
> > *surprising* for the __sync primitives to allow this ordering, and more
> > likely points to a break in the AArch64 implementation.
>
> With the caveat that given that __sync isn't documented in great detail, a
> lot of interpretations might happen in practice, so there might be a few
> surprises to some people :)
:) Agreed.
> > Fixing this requires a hard memory barrier - but I really don't want to see
> > us penalise C11 SEQ_CST to get the right behaviour out of
> > __sync_fetch_and_add...
>
> I agree.