https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461
Bug ID: 67461
Summary: Multiple atomic stores generate a StoreLoad barrier
between each one, not just at the end
Product: gcc
Version: 5.2.0
Status: UNCONFIRMED
Severity: minor
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Multiple atomic stores in a row generate multiple barriers.
I noticed this while playing around with the same code that led to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458. This is a separate issue,
but if I left out any context, look at that bug.
I suspect this is a case of correctness trumps performance, since atomics are
still new. These cases are probably just missing optimizations for weird
use-cases, and fixing them is not very likely to benefit real code (unless it
over-uses atomics).
#include <atomic>
std::atomic<int> a, c;
void simple_set(void){ a=1; a=1; a=3; c=2; a=3; }
compiles to (x86, g++ 5.2.0 -O3, on godbolt.org)
movl $1, %eax
movl %eax, a(%rip) # a=1
mfence
movl %eax, a(%rip) # a=1
movl $3, %eax
mfence
movl %eax, a(%rip) # a=3
mfence
movl $2, c(%rip) # c=2
mfence
movl %eax, a(%rip) # a=3
mfence
ret
First, does the C++ standard actually require multiple stores to the same
variable in a row, if there are no intervening loads or stores in the source?
I would have thought that at least a=1; a=1; would collapse to a single store.
Consider
initially: a=0
thread1: a=1; a=1;
thread2: tmp=a.exchange(2); tmp2=a;
These operations have to happen in some order, but isn't the compiler allowed
to make decisions at compile-time that eliminate some possible orders? e.g.
collapsing both a=1 operations into a single store would make this impossible:
a=1; tmp=a.exchange(2); a=1; tmp2=a;
But the remaining two orderings are valid, and I think it would be an error for
software to depend on that interleaved ordering being possible. Does the
standard require generation of machine code that can end up with tmp=1, tmp2=1?
If it does, then this isn't a bug. >.<
More generally, collapsing a=1; a=3; into a single store should be ok for the
same reason.
A producer thread doing stores separated by StoreStore barriers to feed a
consumer thread doing loads separated by LoadLoad barriers gives no guarantee
that the consumer doesn't miss some events.
-----------
There are no loads between the stores, so I don't understand having multiple
StoreLoad barriers (mfence), unless that's just a missing optimization, too.
Are the mfence instructions between each store supposed to protect a signal
handler from something? An interrupt could come in after the first store, but
before the first mfence. (clang uses (lock) xchg for each atomic store with
sequential consistency, which would prevent the possibility of an interrupt
between the store and the mfence).
I guess if the signal handler sees a=3, it knows that the mfence between a=1
and a=3 has already happened, but not necessarily the mfence after a=3.
If these extra mfences in a sequence of stores are just for the potential
benefit of a signal handler, doesn't that already as part of the context switch
to/from the kernel? It seems very inefficient that stores to multiple atomic
variables produces multiple mfences.
It's worse for ARM, where there's a full memory barrier before/after every
atomic store, so two stores in a row produces two memory barriers in a row.
dmb sy
movs r2, #1
str r2, [r3] # a=1
dmb sy
dmb sy
str r2, [r3] # a=1
dmb sy