https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352

            Bug ID: 86352
           Summary: setc/movzx introduced into loop to provide a constant
                    0 value for a later rep stos
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

The wrong-code bug 86314 also revealed some very weird code-gen decisions,
which the fix didn't improve.

(I think the lock bts peephole is seen pretty late, and that's one necessary
factor for this problem.  But even without it, an unnecessary data dependency
between the lock bts loop and clearing memory is silly.)

This ended up being about 5 separate bugs, but IDK which belong together or are
already reported:

* useless mov %rsi, %rcx and useless mov %rdx, %rdi
* using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and
creating a data dependency
* Doing that inside the loop instead of after
* Not adjusting register allocation to allow xor / set-flags / setc
* rep stos vs. vector stores as a zeroing strategy vs. any other repeated
value.



The reproducer test-case for bug 86314  loops until it finds and claims a zero
bit in a uint64_t, then returns a Bucket() object (with a constructor that
zero-initializes it) with no data dependency on anything.

But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop
instead of just using  xor eax,eax  before rep stosq.  The loop can only exit
when CF = 0, so RAX = 0, so it's not a correctness problem.

The loop is branching on CF as set by BTS, so there's no need to have the 0/1
in a register at all inside the loop, and setc/movzx from a known-zero CF is
more expensive that xor-zeroing.  (Plus it gives the STOSQ a data dependency on
the LOCK BTS flag result which it wouldn't have otherwise.  The stores can't
commit until after the lock memory barrier, but they can execute.)

This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627
https://godbolt.org/g/XGF5tR


BucketMap::acquireBucket():
        movq    %rdi, %rdx
        movq    %rsi, %rcx      # useless, lock bts can use (%rsi)
.L2:
        movq    (%rsi), %rax
        andl    $1, %eax        # source is simplified to only check positions
0 or 1
        lock btsq       %rax, (%rcx)  # Why not (%rsi)?
        setc    %al
        movzbl  %al, %eax       # xor / bts / setc would have been possible
with a different reg
        jc      .L2
        # rax = 0 because the loop can only exit when CF=0

        # should use  xor %eax,%eax  here instead

        movq    %rdx, %rdi      # Useless, RDI still == RDX
        movl    $16, %ecx
        rep stosq
        movq    %rdx, %rax      # can't be done before rep stosq: RAX needs to
be 0
        ret     



With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending
with an OR.  So there is a zero in an integer register from that, but it's not
in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not
cheaper than xor %eax,%eax especially with -march=haswell.  Sandybridge-family
has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and
SnB itself doesn't have it.

And of course mov still has a data dependency on the source of the zero, so it
defeats the effect of branch prediction + speculative breaking (control)
dependencies.  This last applies on any out-of-order x86.

I guess the lock bts peephole is seen too late to notice that it can't recycle
the 0 from the loop condition anymore, and ends up generating code to
materialize it.  But why inside the loop?

------


Even if we *did* need an integer 0/1 in a register inside the loop, we could
still use the xor / set-flags / setcc optimization: Simply use a register other
than RAX for the load / AND $1 / bts source.  And you can hoist the xor-zeroing
out of the loop.


        xor     %eax, %eax
.L2:
        movq    (%rsi), %rcx
        andl    $1, %ecx
        lock btsq       %rax, (%rsi)
        setc    %al
        # use %rax
        jc      .L2


---

Separately:

If the initializer is non-zero, it uses SSE or AVX stores.  That makes no sense
either: if rep stosq is optimal, use  mov eax, 1 for the all-ones case.  (See
the ifdef in the Godbolt link to try it)

If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector.

I guess gcc is checking for all-zeros as a common special case, but doesn't
check for repeats of any other value, except for repeated bytes recognized as
memset.

So it makes sense that gcc uses a different strategy, but I think for only 16x
8 bytes (128 bytes) that vector stores beat rep stos on current CPUs.  (That
may change when IceLake introduces fast short-rep stos/movs.)

GCC does notice that it can reuse the same vector repeatedly, though, but not
that it could have loaded it with vpbroadcastq from an 8-byte constant instead
of a 32-byte constant (with -march=haswell for AVX2).  Broadcast-loads are no
more expensive than regular vector loads, but use less space for constants.

Reply via email to