On 02/02/17 15:21, Torvald Riegel wrote:
On Thu, 2017-02-02 at 14:48 +0000, Ramana Radhakrishnan wrote:
On 30/01/17 18:54, Torvald Riegel wrote:
This patch fixes the __atomic builtins to not implement supposedly
lock-free atomic loads based on just a compare-and-swap operation.

If there is no hardware-backed atomic load for a certain memory
location, the current implementation can implement the load with a CAS
while claiming that the access is lock-free.  This is a bug in the cases
of volatile atomic loads and atomic loads to read-only-mapped memory; it
also creates a lot of contention in case of concurrent atomic loads,
which results in at least counter-intuitive performance because most
users probably understand "lock-free" to mean hardware-backed (and thus
"fast") instead of just in the progress-criteria sense.

This patch implements option 3b of the choices described here:
https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html


Will Deacon pointed me at this thread asking if something similar could
be done on ARM.

It would be nice if someone more familiar with ARM could double-check
that ARM is not affected.  I guess ARM isn't, but that's based on me
looking at machine descriptions, which I hadn't ever done before working
on this patch...


ARM doesn't have __int128 support, so I don't think the problem exists there.

On ARM, on architecture levels (i.e arch < armv6k) that do not have single copy atomic routines we end up with calling the kernel helper routines where the appropriate handling is done by the kernel depending on whether you are multicore or not.

__atomic_load on ARM appears to be ok as well

except for

__atomic_load_di which should really be the ldrexd / strexd loop but we could ameliorate that similar to your option 3b.



On AArch64

* <16 byte loads have always been fine. The architecture allows single copy atomic loads using single load instructions for all other sizes and memory models, so we are fine there.

* we have gone through the libatomic locks from day one of the port for 16 byte loads. This has been a bit of a bugbear for a number of users within ARM who would really like to get performance without heavy weight locks for 16 byte atomic ops. We could never switch this around on AArch64 to using the loop (or CAS) by default as this would be the reverse issue (i.e. old compilers calling libatomic locks and new code using the inlined instruction sequence).


My interest in this patch was piqued because if we were to switch aarch64 to not use the locks in libatomic and go to a CAS or the loop sequence I showed in my reply to Jakub, I don't believe we have an ABI issue as I would expect there to be a single copy of libatomic on the system and everyone would simply be using that.

However we would end up with the situation of generating stores for __atomic_loads as you described above.

We could still in theory do that and gain the same bug for rodata and volatile atomic pointers on AArch64 - I don't see a way of avoiding that on aarch32 as we have a similar ABI issue to you.


The patch introduces a can_atomic_load_p, which checks that an entry
exists in optabs and the machine descriptions for an atomic load of the
particular mode.

IIRC, arm and aarch64 had atomic load patterns defined whenever they had
a CAS defined.  It would be nice if you could double-check that.  If
that's the case, nothing should change with my patch because
can_atomic_load_p would always return true if a CAS could be issued.
If that's not the case, arm/aarch64 could be in the same boat as x86_64
with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
solution for ARM as well (OTOH, compatibility with existing code and
__sync builtins might perhaps not be as much of a factor as it might be
on x86_64).

On AArch64 IIRC only those instructions that are single copy atomic as per the architecture are allowed to be atomic loads. Thus we do not generate such loops anywhere.



The atomic load patterns you have could still be wrong, for example by
generating a LDXP/STXP loop or something else that can store on an
atomic load.  In that case, the problem is similar as not having a
custom load pattern, and the second case in the previous paragraph
applies.



On armv8-a we can implement an atomic load of 16 bytes using an LDXP /
STXP loop as a 16 byte load isnt' single copy atomic. On armv8.1-a we do
have a CAS on 16 bytes.

This was discussed last year here.

https://gcc.gnu.org/ml/gcc/2016-06/msg00017.html

and the consensus seemed to be that we couldn't really do a CAS on
readonly data as we would have no way of avoiding a trap.

Yes, and I agree that not doing a CAS (or anything that can store on
atomic load) is the right thing to do.

BTW, on the ARM targets, is it possible that the __sync builtins would
use LDXP/STXP on 16 byte and the __atomic loads would fall back to using
libatomic and locks?  Or do you simply not provide 16-byte __sync
builtins?

* Atomic loads always using libatomic and locks for 16 bytes on AArch64.
* There are no sync primitives provided for 16 byte operations on AArch64 and it's been that way forever. Thus __sync_fetch_and_add on a TImode value will cause a link error. Arguably that's a bug but no one seems to have actually complained so far over the last many years.


I'm sure I'm missing something piece of the puzzle, so I'd be interested
in how you avoid this in this implementation on x86_64 with what I can
see is a CAS.

If we'd start from scratch, we wouldn't use cmpxchg16b if we don't have
a 16-byte atomic load.  However, we did do that, and there might be code
out there that does have inlined cmpxchg16b.  As discussed in the thread
you cited, changing GCC 7 back to libatomics *and* the use of locks
would be an effective ABI break.  Therefore, my patch (and Option 3b)
makes a compromise and delegates to libatomic but libatomic gets custom
code to actually keep using cmpxchg16b.  That means that the actual
synchronization doesn't change, but we constrain this problem to
libatomic and prevent application code generated with GCC 7 from being
directly affected.

Does this clarify the x86_64 situation?


Thanks for clarifying the x86_64 situation.

I hope this helps in explaining the ARM situation.

regards
Ramana



Reply via email to