https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666
Bug ID: 82666 Summary: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2) Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* long long sumarray(const int *data) { data = (const int*)__builtin_assume_aligned(data, 64); long long sum = 0; for (int c=0 ; c<32768 ; c++) sum += (data[c] >= 128 ? data[c] : 0); return sum; } The loop body is written to encourage gcc to make the loop-carried dep chain just an ADD, with independent branchless zeroing of each input. But unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older gcc -O3 from if (data[c] >= 128) // doesn't auto-vectorize with gcc4, unlike the above sum += data[c]; See also https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2. https://godbolt.org/g/GgVp7E gcc8.0 8.0.0 20171022 -O2 -mtune=haswell (slow) leaq 131072(%rdi), %rsi xorl %eax, %eax .L3: movslq (%rdi), %rdx movq %rdx, %rcx addq %rax, %rdx # mov+add could have been LEA cmpl $127, %ecx cmovg %rdx, %rax # sum = (x>=128 : sum+x : sum) addq $4, %rdi cmpq %rsi, %rdi jne .L3 ret This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx and cmov). It's also 8 fused-domain uops (1 more than older gcc) but using LEA would fix that. gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test) leaq 131072(%rdi), %rsi xorl %eax, %eax xorl %ecx, %ecx # extra zero constant for a cmov source .L3: movslq (%rdi), %rdx cmpl $127, %edx cmovle %rcx, %rdx # rdx = 0 when rdx<=128 addq $4, %rdi addq %rdx, %rax # sum += ... critical path 1c latency cmpq %rsi, %rdi jne .L3 ret 7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell). Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd number of uops in the loop buffer), bottlenecked on the front-end. The latency bottleneck is only 1 cycle. (Which Ryzen might come closer to.) Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x slower with gcc7/8 than with gcc6 and earlier. Moreover, gcc should try to optimize something like this: if (data[c] >= 128) sum += data[c]; into conditionally zeroing a register instead of using a loop-carried cmov dep chain.