Re: [PATCH] Fix __atomic to not implement atomic loads with CAS.

Ramana Radhakrishnan Fri, 03 Feb 2017 05:45:59 -0800

On 02/02/17 15:21, Torvald Riegel wrote:

On Thu, 2017-02-02 at 14:48 +0000, Ramana Radhakrishnan wrote:

On 30/01/17 18:54, Torvald Riegel wrote:

This patch fixes the __atomic builtins to not implement supposedly
lock-free atomic loads based on just a compare-and-swap operation.


If there is no hardware-backed atomic load for a certain memory
location, the current implementation can implement the load with a CAS
while claiming that the access is lock-free.  This is a bug in the cases
of volatile atomic loads and atomic loads to read-only-mapped memory; it
also creates a lot of contention in case of concurrent atomic loads,
which results in at least counter-intuitive performance because most
users probably understand "lock-free" to mean hardware-backed (and thus
"fast") instead of just in the progress-criteria sense.

This patch implements option 3b of the choices described here:
https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html



Will Deacon pointed me at this thread asking if something similar could
be done on ARM.


It would be nice if someone more familiar with ARM could double-check
that ARM is not affected.  I guess ARM isn't, but that's based on me
looking at machine descriptions, which I hadn't ever done before working
on this patch...

ARM doesn't have __int128 support, so I don't think the problem existsthere.

On ARM, on architecture levels (i.e arch < armv6k) that do not havesingle copy atomic routines we end up with calling the kernel helperroutines where the appropriate handling is done by the kernel dependingon whether you are multicore or not.


__atomic_load on ARM appears to be ok as well

except for

__atomic_load_di which should really be the ldrexd / strexd loop but wecould ameliorate that similar to your option 3b.




On AArch64

* <16 byte loads have always been fine. The architecture allows singlecopy atomic loads using single load instructions for all other sizes andmemory models, so we are fine there.

* we have gone through the libatomic locks from day one of the port for16 byte loads. This has been a bit of a bugbear for a number of userswithin ARM who would really like to get performance without heavy weightlocks for 16 byte atomic ops.We could never switch this around on AArch64 to using the loop (or CAS)by default as this would be the reverse issue (i.e. old compilerscalling libatomic locks and new code using the inlined instructionsequence).

My interest in this patch was piqued because if we were to switchaarch64 to not use the locks in libatomic and go to a CAS or the loopsequence I showed in my reply to Jakub, I don't believe we have an ABIissue as I would expect there to be a single copy of libatomic on thesystem and everyone would simply be using that.

However we would end up with the situation of generating stores for__atomic_loads as you described above.

We could still in theory do that and gain the same bug for rodata andvolatile atomic pointers on AArch64 - I don't see a way of avoiding thaton aarch32 as we have a similar ABI issue to you.

The patch introduces a can_atomic_load_p, which checks that an entry
exists in optabs and the machine descriptions for an atomic load of the
particular mode.

IIRC, arm and aarch64 had atomic load patterns defined whenever they had
a CAS defined.  It would be nice if you could double-check that.  If
that's the case, nothing should change with my patch because
can_atomic_load_p would always return true if a CAS could be issued.
If that's not the case, arm/aarch64 could be in the same boat as x86_64
with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
solution for ARM as well (OTOH, compatibility with existing code and
__sync builtins might perhaps not be as much of a factor as it might be
on x86_64).

On AArch64 IIRC only those instructions that are single copy atomic asper the architecture are allowed to be atomic loads. Thus we do notgenerate such loops anywhere.


The atomic load patterns you have could still be wrong, for example by
generating a LDXP/STXP loop or something else that can store on an
atomic load.  In that case, the problem is similar as not having a
custom load pattern, and the second case in the previous paragraph
applies.

On armv8-a we can implement an atomic load of 16 bytes using an LDXP /
STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do
have a CAS on 16 bytes.

This was discussed last year here.

https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html

and the consensus seemed to be that we couldn't really do a CAS on
readonly data as we would have no way of avoiding a trap.


Yes, and I agree that not doing a CAS (or anything that can store on
atomic load) is the right thing to do.

BTW, on the ARM targets, is it possible that the __sync builtins would
use LDXP/STXP on 16 byte and the __atomic loads would fall back to using
libatomic and locks?  Or do you simply not provide 16-byte __sync
builtins?


* Atomic loads always using libatomic and locks for 16 bytes on AArch64.

* There are no sync primitives provided for 16 byte operations onAArch64 and it's been that way forever. Thus __sync_fetch_and_add on aTImode value will cause a link error. Arguably that's a bug but no oneseems to have actually complained so far over the last many years.

I'm sure I'm missing something piece of the puzzle, so I'd be interested
in how you avoid this in this implementation on x86_64 with what I can
see is a CAS.


If we'd start from scratch, we wouldn't use cmpxchg16b if we don't have
a 16-byte atomic load.  However, we did do that, and there might be code
out there that does have inlined cmpxchg16b.  As discussed in the thread
you cited, changing GCC 7 back to libatomics *and* the use of locks
would be an effective ABI break.  Therefore, my patch (and Option 3b)
makes a compromise and delegates to libatomic but libatomic gets custom
code to actually keep using cmpxchg16b.  That means that the actual
synchronization doesn't change, but we constrain this problem to
libatomic and prevent application code generated with GCC 7 from being
directly affected.

Does this clarify the x86_64 situation?


Thanks for clarifying the x86_64 situation.

I hope this helps in explaining the ARM situation.

regards
Ramana

Re: [PATCH] Fix __atomic to not implement atomic loads with CAS.

Reply via email to