https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |peter at cordes dot ca --- Comment #4 from Peter Cordes <peter at cordes dot ca> --- Created attachment 42125 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42125&action=edit C11 / pthreads test case for tearing of atomic_llong. compile with -m32 -pthread This is a real bug. See attached C11 testcase for atomic_llong tearing of loads/stores in practice on real x86 hardware with gcc -m32. (It also compiles as C++11 to show the difference). One thread writes 0 and -1 alternating, the other thread reads and checks that the value is either 0 or -1. (Or with `double`, 0 and -1.1, or with a two-member struct, {-1,-1}). Compile with gcc -m32 -march=native -O3 -pthread atomic_llong-underaligned-in-struct.c && ./a.out offset of x in AtomicStruct = 60. addr=0x565a80bc. lockfree() = 1 sizeof(test_type) = 8. test_type = long long alignof(AtomicStruct) = 4, alignof(atomic_ttype) = 4, alignof(test_type) = 4 found tearing: tmp = 0x000000ffffffff If there's a problem, the whole program uses under a millisecond of CPU time, probably most of it on startup and printing. (e.g. perf counters for machine_clears.memory_ordering = 72, so it didn't spend long in the read loop with the writer going). I get the object to cross a cache-line boundary if the type is under-aligned by using struct AtomicStruct { char filler[57]; atomic_ttype x; }; and using alignas(64). In the x86 32-bit System V ABI, gcc under-aligns an atomic_llong so it can split across the boundary between two cache lines. This makes loads and stores non-atomic on every CPU (except uniprocessor of course). Some AMD CPUs are potentially non-atomic when 8B or 16B boundaries are crossed. (https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic). Here are some ways to fix this for the i386 System V ABI. (I think the Windows 32-bit ABI aligns long long and double to 64b, but an _Atomic struct still potentially needs more than its default alignment to be efficiently lock-free). 1. Change the C stdatomic ABI to match the C++ std::atomic ABI, requiring lock-free atomic objects to be naturally aligned. (But don't change anything for non-atomic objects. The i386 SysV ABI only aligns 64-bit long long to 4B, and we do need to preserve that.) 2. Use lock cmpxchg8b to implement .load() and .store() instead of SSE or x87 load or store instructions on all 8-byte objects, like for 16-byte objects in x86-64 with cmpxchg16b. 3. Make 64-bit objects check alignment before using lock-free sequences, otherwise use locking (or lock cmpxchg8b). Checks can be optimized out in cases where the compiler can prove that an object is 8B-aligned. e.g. an object with static storage since we get to align it. Unless we're linking with code compiled by an old gcc that under-aligned static 64b objects. 4. Make 64-bit objects never lock-free in the i386 SysV ABI. 5. Option 4 + define a new 32-bit ABI that doesn't suck. (pass args in registers, use SSE FP, etc., and align 64-bit types to 64-bit.) Not realistic because nobody cares enough about 32-bit code outside of Windows, so the small benefit wouldn't justify the pain of having 2 incompatible ABIs. Clang is already doing option 1, so gcc and clang are currently incompatible for struct layout for structs with a C11 _Atomic member. Option 1 is by far the best long-term option for performance and simplicity. (Not counting option 5). Option 2 will work, but is always horrible for performance with pure loads or non-seq-cst stores, even with aligned objects. lock cmpxchg8b is atomic even when it crosses a cache-line or page boundary (like all x86 atomic RMW operations using the lock prefix), or whatever boundary is atomic for regular loads/stores. This is **catastrophic** for performance, though, because instead of just internally locking a line of L1D cache (by delaying responses to Invalidate its copy), the CPU has to make sure the change to both cache lines propagates all the way to memory, I think. (x86 locked instructions are atomic with respect to I/O and DMA observers, not just other CPUs, so it can't just keep both cache lines locked). On my Skylake i7-6700k, it's literally a 132x slowdown for a single thread doing `lock add` aligned vs. crossing a cache line boundary. These penalties will happen by chance for more 8B objects on AMD hardware if crossing a 16B or 32B boundary really is non-atomic for regular loads/stores, instead of only having a penalty at 64B boundaries. 00000000004000e0 <_start.loop>: 4000e0: f0 48 83 47 3f 01 lock add QWORD PTR [rdi+0x3f],0x1 ## rdi is page-aligned 4000e6: f0 48 83 47 7f 01 lock add QWORD PTR [rdi+0x7f],0x1 4000ec: f0 48 83 87 bf 00 00 00 01 lock add QWORD PTR [rdi+0xbf],0x1 4000f5: f0 48 83 87 ff 00 00 00 01 lock add QWORD PTR [rdi+0xff],0x1 4000fe: ff cd dec ebp 400100: 7f de jg 4000e0 <_start.loop> This runs in 29221 ms (113.961G clock cycles) for 10M loop iterations. That's ~1119 cycles per instruction (including the loop branch). The equivalent loop with offsets of +0x40 etc (so every object in a separate single cache line) ran in 220ms (0.8605G clock cycles). The equivalent loop with movq [rdi+0x40], xmm0 runs in 10ms (0.0402G cycles). Half that for loads. (Not counting the time to unpack back to integer registers for long long, but this is fair for double). So that's another factor of ~20 for pure load / pure store vs. aligned lock add (which is about the same as lock cmpxchg8b) Thus the worst-case penalty (misaligned lock cmpxchg8b emulation of load/store vs. aligned SSE2 simple load or release-store) is about a factor of 3000. ------ Something has regressed since this bug was filed, and now alignof(atomic_llong) is only 4, so this is a potential problem even outside of structs. Same for _Atomic double or _Atomic struct { int a,b; }: these 8 byte objects need 8B alignment to work correctly as lock-free objects. C++11 std::atomic previously had the same bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147), filed at the same time, but it got fixed correctly so lock-free std::atomic<> objects always have natural alignment.