On 30/01/17 18:54, Torvald Riegel wrote:
This patch fixes the __atomic builtins to not implement supposedly
lock-free atomic loads based on just a compare-and-swap operation.

If there is no hardware-backed atomic load for a certain memory
location, the current implementation can implement the load with a CAS
while claiming that the access is lock-free.  This is a bug in the cases
of volatile atomic loads and atomic loads to read-only-mapped memory; it
also creates a lot of contention in case of concurrent atomic loads,
which results in at least counter-intuitive performance because most
users probably understand "lock-free" to mean hardware-backed (and thus
"fast") instead of just in the progress-criteria sense.

This patch implements option 3b of the choices described here:
https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html


Will Deacon pointed me at this thread asking if something similar could be done on ARM.

On armv8-a we can implement an atomic load of 16 bytes using an LDXP / STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do have a CAS on 16 bytes.

This was discussed last year here.

https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html

and the consensus seemed to be that we couldn't really do a CAS on readonly data as we would have no way of avoiding a trap.

I'm sure I'm missing something piece of the puzzle, so I'd be interested in how you avoid this in this implementation on x86_64 with what I can see is a CAS.

regards
Ramana



In a nutshell, this does change affected accesses to call libatomic
instead of inlining the CAS-based load emulation -- but libatomic will
continue to do the CAS-based approach.  Thus, there's no change in how
the changes are actually performed (including compatibility with the
__sync builtins, which are not changed) -- but we do make it much easier
to fix this in the future, and we do ensure that less future code will
have the problematic code inlined into it (and thus unfixable).

Second, the return of __atomic_always_lock_free and
__atomic_is_lock_free are changed to report false for the affected
accesses.  This aims at making users aware of the fact that they don't
get fast hardware-backed performance for the affected accesses.

This patch does not yet change how OpenMP atomics support is
implemented.  Jakub said he would take care of that.  I suppose one
approach would be to check can_atomic_load_p (introduced by this patch)
to decide in expand_omp_atomic whether to just use the mutex-based
approach; I think that conservatively, OpenMP atomics have to assume
that there could be an atomic load to a particular memory location
elsewhere in the program, so the choice between mutex-backed or not has
to be consistent for a particular memory location.

Thanks to Richard Henderson for providing the libatomic part of this
patch, and thanks to Andrew MacLeod for helping with the compiler parts.

I've tested this on an x86_64-linux bootstrap build and see no
regressions.  (With the exception of two OpenMP failures, which Jakub
will take care of.  The other failures I saw didn't seem atomics related
(eg, openacc); I haven't compared it against a full bootstrap build and
make check of trunk.).

AFAICT, in practice only x86 with -mcx16 (or where this is implicitly
implied) is affected.  The other architectures I looked at seemed to
have patterns for true hardware-backed atomic loads whenever they also
had a wider-than-word-sized CAS available.

I know we missed the stage3 deadline with this, unfortunately.  I think
it would be good to have this fix be part of GCC 7 though, because this
would ensure that less future code has the problematic load-via-CAS code
inlined.

Jakub: If this is OK for GCC 7, can you please take care of the OpenMP
bits and commit this?  Changelog entries are in the commit message.

If others could test on other hardware, this would also be appreciated.


Reply via email to