https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461
Bug ID: 67461 Summary: Multiple atomic stores generate a StoreLoad barrier between each one, not just at the end Product: gcc Version: 5.2.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Multiple atomic stores in a row generate multiple barriers. I noticed this while playing around with the same code that led to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458. This is a separate issue, but if I left out any context, look at that bug. I suspect this is a case of correctness trumps performance, since atomics are still new. These cases are probably just missing optimizations for weird use-cases, and fixing them is not very likely to benefit real code (unless it over-uses atomics). #include <atomic> std::atomic<int> a, c; void simple_set(void){ a=1; a=1; a=3; c=2; a=3; } compiles to (x86, g++ 5.2.0 -O3, on godbolt.org) movl $1, %eax movl %eax, a(%rip) # a=1 mfence movl %eax, a(%rip) # a=1 movl $3, %eax mfence movl %eax, a(%rip) # a=3 mfence movl $2, c(%rip) # c=2 mfence movl %eax, a(%rip) # a=3 mfence ret First, does the C++ standard actually require multiple stores to the same variable in a row, if there are no intervening loads or stores in the source? I would have thought that at least a=1; a=1; would collapse to a single store. Consider initially: a=0 thread1: a=1; a=1; thread2: tmp=a.exchange(2); tmp2=a; These operations have to happen in some order, but isn't the compiler allowed to make decisions at compile-time that eliminate some possible orders? e.g. collapsing both a=1 operations into a single store would make this impossible: a=1; tmp=a.exchange(2); a=1; tmp2=a; But the remaining two orderings are valid, and I think it would be an error for software to depend on that interleaved ordering being possible. Does the standard require generation of machine code that can end up with tmp=1, tmp2=1? If it does, then this isn't a bug. >.< More generally, collapsing a=1; a=3; into a single store should be ok for the same reason. A producer thread doing stores separated by StoreStore barriers to feed a consumer thread doing loads separated by LoadLoad barriers gives no guarantee that the consumer doesn't miss some events. ----------- There are no loads between the stores, so I don't understand having multiple StoreLoad barriers (mfence), unless that's just a missing optimization, too. Are the mfence instructions between each store supposed to protect a signal handler from something? An interrupt could come in after the first store, but before the first mfence. (clang uses (lock) xchg for each atomic store with sequential consistency, which would prevent the possibility of an interrupt between the store and the mfence). I guess if the signal handler sees a=3, it knows that the mfence between a=1 and a=3 has already happened, but not necessarily the mfence after a=3. If these extra mfences in a sequence of stores are just for the potential benefit of a signal handler, doesn't that already as part of the context switch to/from the kernel? It seems very inefficient that stores to multiple atomic variables produces multiple mfences. It's worse for ARM, where there's a full memory barrier before/after every atomic store, so two stores in a row produces two memory barriers in a row. dmb sy movs r2, #1 str r2, [r3] # a=1 dmb sy dmb sy str r2, [r3] # a=1 dmb sy