https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750
Bug ID: 95750 Summary: [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst) Product: gcc Version: 10.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: andysem at mail dot ru Target Milestone: --- Currently, __atomic_thread_fence(seq_cst) on x86 and x86-64 generates mfence instruction. A dummy atomic instruction (a lock-prefixed instruction or xchg with a memory operand) would provide the same sequential consistency guarantees while being more efficient on most current CPUs. The mfence instruction additionally orders non-temporal stores, which is not relevant for atomic operations and are not ordered by seq_cst atomic operations anyway. Regarding performance, some data is available in Agner Fog's instruction tables: https://www.agner.org/optimize/ Also, there is this article: https://shipilev.net/blog/2014/on-the-fence-with-dependencies/ TL;DR: There is benefit on every CPU except Atom; on Atom there is no difference. Regarding the dummy instruction and target memory location, here are some considerations: - The lock-prefixed instruction should preferably not alter flags or registers and should require minimum number of registers. - The memory location should not be shared with other threads. - The memory location should likely be in cache. - The memory location should not alias existing data on the stack, so that we don't introduce a false data dependency on previous or subsequent instructions. Based on the above, a good candidate is "lock not" on a dummy variable on the top of the stack. Note that the variable would be accessible through esp/rsp, it is likely to be in hot memory and is likely thread-private. I've implemented this optimization in Boost.Atomic, and a similar optimization is done in MSVC: https://github.com/microsoft/STL/pull/740