https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
--- Comment #48 from Luke Dalessandro <ldalessandro at gmail dot com> --- (In reply to LIU Hao from comment #47) > (In reply to Luke Dalessandro from comment #46) > > But if 104688 isn't related to this issue, and thus Jakub's comment was in > > error, I definitely don't understand the underlying problem and why clang is > > fine doing it. > > Issue here is that if atomic load is implemented with a call to libatomic > routines then it's incorrect to implement CAS without a call. So my understanding is that 104688 basically determined that it's correct to implement atomic load with movdqa for aligned addresses on architectures with AVX support. And hence gcc could inline that in the same way clang does, and inline cmpxchg16b for compare_exchange/__atomic_compare_exchange{_n} as well. And thus there no longer has to be a libatomic call for any of these. I can support the fact that -mcx16 is maybe the wrong flag to use to force inlining here given it's cmpxchg-style name, but it really feels like a sophisticated user that's willing to live in implementation-defined land should be able to get the same performance for lock-free code out of gcc that it does out of clang in this situation.