https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461

            Bug ID: 67461
           Summary: Multiple atomic stores generate a StoreLoad barrier
                    between each one, not just at the end
           Product: gcc
           Version: 5.2.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: c++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Multiple atomic stores in a row generate multiple barriers.

I noticed this while playing around with the same code that led to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458.  This is a separate issue,
but if I left out any context, look at that bug.

I suspect this is a case of correctness trumps performance, since atomics are
still new.  These cases are probably just missing optimizations for weird
use-cases, and fixing them is not very likely to benefit real code (unless it
over-uses atomics).



#include <atomic>
std::atomic<int> a, c;
void simple_set(void){ a=1; a=1; a=3; c=2; a=3; }

compiles to (x86, g++ 5.2.0 -O3, on godbolt.org)

        movl    $1, %eax
        movl    %eax, a(%rip)  # a=1
        mfence
        movl    %eax, a(%rip)  # a=1
        movl    $3, %eax
        mfence
        movl    %eax, a(%rip)  # a=3
        mfence
        movl    $2, c(%rip)    # c=2
        mfence
        movl    %eax, a(%rip)  # a=3
        mfence
        ret

First, does the C++ standard actually require multiple stores to the same
variable in a row, if there are no intervening loads or stores in the source? 
I would have thought that at least a=1; a=1; would collapse to a single store.  

Consider
  initially: a=0
  thread1: a=1; a=1;
  thread2: tmp=a.exchange(2); tmp2=a;

These operations have to happen in some order, but isn't the compiler allowed
to make decisions at compile-time that eliminate some possible orders?  e.g.
collapsing both a=1 operations into a single store would make this impossible:

  a=1; tmp=a.exchange(2); a=1; tmp2=a; 

But the remaining two orderings are valid, and I think it would be an error for
software to depend on that interleaved ordering being possible.  Does the
standard require generation of machine code that can end up with tmp=1, tmp2=1?
 If it does, then this isn't a bug.  >.<

More generally, collapsing a=1; a=3;  into a single store should be ok for the
same reason.

 A producer thread doing stores separated by StoreStore barriers to feed a
consumer thread doing loads separated by LoadLoad barriers gives no guarantee
that the consumer doesn't miss some events.

-----------

There are no loads between the stores, so I don't understand having multiple
StoreLoad barriers (mfence), unless that's just a missing optimization, too.

Are the mfence instructions between each store supposed to protect a signal
handler from something?  An interrupt could come in after the first store, but
before the first mfence.  (clang uses (lock) xchg for each atomic store with
sequential consistency, which would prevent the possibility of an interrupt
between the store and the mfence).

 I guess if the signal handler sees a=3, it knows that the mfence between a=1
and a=3 has already happened, but not necessarily the mfence after a=3.

 If these extra mfences in a sequence of stores are just for the potential
benefit of a signal handler, doesn't that already as part of the context switch
to/from the kernel?  It seems very inefficient that stores to multiple atomic
variables produces multiple mfences.

 It's worse for ARM, where there's a full memory barrier before/after every
atomic store, so two stores in a row produces two memory barriers in a row.

        dmb     sy
        movs    r2, #1
        str     r2, [r3]   # a=1
        dmb     sy
        dmb     sy
        str     r2, [r3]   # a=1
        dmb     sy

Reply via email to