https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352
Bug ID: 86352 Summary: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* The wrong-code bug 86314 also revealed some very weird code-gen decisions, which the fix didn't improve. (I think the lock bts peephole is seen pretty late, and that's one necessary factor for this problem. But even without it, an unnecessary data dependency between the lock bts loop and clearing memory is silly.) This ended up being about 5 separate bugs, but IDK which belong together or are already reported: * useless mov %rsi, %rcx and useless mov %rdx, %rdi * using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and creating a data dependency * Doing that inside the loop instead of after * Not adjusting register allocation to allow xor / set-flags / setc * rep stos vs. vector stores as a zeroing strategy vs. any other repeated value. The reproducer test-case for bug 86314 loops until it finds and claims a zero bit in a uint64_t, then returns a Bucket() object (with a constructor that zero-initializes it) with no data dependency on anything. But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop instead of just using xor eax,eax before rep stosq. The loop can only exit when CF = 0, so RAX = 0, so it's not a correctness problem. The loop is branching on CF as set by BTS, so there's no need to have the 0/1 in a register at all inside the loop, and setc/movzx from a known-zero CF is more expensive that xor-zeroing. (Plus it gives the STOSQ a data dependency on the LOCK BTS flag result which it wouldn't have otherwise. The stores can't commit until after the lock memory barrier, but they can execute.) This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627 https://godbolt.org/g/XGF5tR BucketMap::acquireBucket(): movq %rdi, %rdx movq %rsi, %rcx # useless, lock bts can use (%rsi) .L2: movq (%rsi), %rax andl $1, %eax # source is simplified to only check positions 0 or 1 lock btsq %rax, (%rcx) # Why not (%rsi)? setc %al movzbl %al, %eax # xor / bts / setc would have been possible with a different reg jc .L2 # rax = 0 because the loop can only exit when CF=0 # should use xor %eax,%eax here instead movq %rdx, %rdi # Useless, RDI still == RDX movl $16, %ecx rep stosq movq %rdx, %rax # can't be done before rep stosq: RAX needs to be 0 ret With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending with an OR. So there is a zero in an integer register from that, but it's not in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not cheaper than xor %eax,%eax especially with -march=haswell. Sandybridge-family has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and SnB itself doesn't have it. And of course mov still has a data dependency on the source of the zero, so it defeats the effect of branch prediction + speculative breaking (control) dependencies. This last applies on any out-of-order x86. I guess the lock bts peephole is seen too late to notice that it can't recycle the 0 from the loop condition anymore, and ends up generating code to materialize it. But why inside the loop? ------ Even if we *did* need an integer 0/1 in a register inside the loop, we could still use the xor / set-flags / setcc optimization: Simply use a register other than RAX for the load / AND $1 / bts source. And you can hoist the xor-zeroing out of the loop. xor %eax, %eax .L2: movq (%rsi), %rcx andl $1, %ecx lock btsq %rax, (%rsi) setc %al # use %rax jc .L2 --- Separately: If the initializer is non-zero, it uses SSE or AVX stores. That makes no sense either: if rep stosq is optimal, use mov eax, 1 for the all-ones case. (See the ifdef in the Godbolt link to try it) If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector. I guess gcc is checking for all-zeros as a common special case, but doesn't check for repeats of any other value, except for repeated bytes recognized as memset. So it makes sense that gcc uses a different strategy, but I think for only 16x 8 bytes (128 bytes) that vector stores beat rep stos on current CPUs. (That may change when IceLake introduces fast short-rep stos/movs.) GCC does notice that it can reuse the same vector repeatedly, though, but not that it could have loaded it with vpbroadcastq from an 8-byte constant instead of a 32-byte constant (with -march=haswell for AVX2). Broadcast-loads are no more expensive than regular vector loads, but use less space for constants.