https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
--- Comment #49 from LIU Hao <lh_mouse at 126 dot com> --- (In reply to Luke Dalessandro from comment #48) > So my understanding is that 104688 basically determined that it's correct to > implement atomic load with movdqa for aligned addresses on architectures > with AVX support. And hence gcc could inline that in the same way clang > does, and inline cmpxchg16b for > compare_exchange/__atomic_compare_exchange{_n} as well. And thus there no > longer has to be a libatomic call for any of these. Yes. However I suspect it might be an ABI break. > I can support the fact that -mcx16 is maybe the wrong flag to use to force > inlining here given it's cmpxchg-style name, but it really feels like a > sophisticated user that's willing to live in implementation-defined land > should be able to get the same performance for lock-free code out of gcc > that it does out of clang in this situation. May I remind you about https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878#c42 ? First CMPXCHG16B can be much slower than CMPXCHG: https://quick-bench.com/q/MZioNHkbBn0soH_KSDyYcKmrrxU Second not all x86-64 processors support CMPXCHG16B, so `-mcx16` is required, like `-mavx`.