[Bug target/95750] New: [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

andysem at mail dot ru Thu, 18 Jun 2020 12:33:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95750


            Bug ID: 95750
           Summary: [x86] Use dummy atomic insn instead of mfence in
                    __atomic_thread_fence(seq_cst)
           Product: gcc
           Version: 10.1.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: andysem at mail dot ru
  Target Milestone: ---

Currently, __atomic_thread_fence(seq_cst) on x86 and x86-64 generates mfence
instruction. A dummy atomic instruction (a lock-prefixed instruction or xchg
with a memory operand) would provide the same sequential consistency guarantees
while being more efficient on most current CPUs. The mfence instruction
additionally orders non-temporal stores, which is not relevant for atomic
operations and are not ordered by seq_cst atomic operations anyway.

Regarding performance, some data is available in Agner Fog's instruction
tables:

https://www.agner.org/optimize/

Also, there is this article:

https://shipilev.net/blog/2014/on-the-fence-with-dependencies/

TL;DR: There is benefit on every CPU except Atom; on Atom there is no
difference.

Regarding the dummy instruction and target memory location, here are some
considerations:

- The lock-prefixed instruction should preferably not alter flags or registers
and should require minimum number of registers.
- The memory location should not be shared with other threads.
- The memory location should likely be in cache.
- The memory location should not alias existing data on the stack, so that we
don't introduce a false data dependency on previous or subsequent instructions.

Based on the above, a good candidate is "lock not" on a dummy variable on the
top of the stack. Note that the variable would be accessible through esp/rsp,
it is likely to be in hot memory and is likely thread-private.

I've implemented this optimization in Boost.Atomic, and a similar optimization
is done in MSVC:

https://github.com/microsoft/STL/pull/740

[Bug target/95750] New: [x86] Use dummy atomic insn instead of mfence in __atomic_thread_fence(seq_cst)

Reply via email to