128 ? x : 0) puts the cmov on the critical path (at -O2)

peter at cordes dot ca Sun, 22 Oct 2017 15:57:55 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82666


            Bug ID: 82666
           Summary: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov
                    on the critical path (at -O2)
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

long long sumarray(const int *data)
{
    data = (const int*)__builtin_assume_aligned(data, 64);
    long long sum = 0;
    for (int c=0 ; c<32768 ; c++)
        sum += (data[c] >= 128 ? data[c] : 0);

    return sum;
}

The loop body is written to encourage gcc to make the loop-carried dep chain
just an ADD, with independent branchless zeroing of each input.  But
unfortunately, gcc7 and gcc8 -O2 de-optimize it back to what we get with older
gcc -O3 from

        if (data[c] >= 128)  // doesn't auto-vectorize with gcc4, unlike the
above
            sum += data[c];

See also
https://stackoverflow.com/questions/28875325/gcc-optimization-flag-o3-makes-code-slower-then-o2.


https://godbolt.org/g/GgVp7E
gcc8.0 8.0.0 20171022  -O2 -mtune=haswell  (slow)

        leaq    131072(%rdi), %rsi
        xorl    %eax, %eax
.L3:
        movslq  (%rdi), %rdx
        movq    %rdx, %rcx
        addq    %rax, %rdx      # mov+add could have been LEA
        cmpl    $127, %ecx
        cmovg   %rdx, %rax      # sum = (x>=128 : sum+x : sum)
        addq    $4, %rdi
        cmpq    %rsi, %rdi
        jne     .L3
        ret

This version has a 3 cycle latency loop-carried dep chain, (addq %rax, %rdx 
and cmov).  It's also 8 fused-domain uops (1 more than older gcc) but using LEA
would fix that.


gcc6.3 -O2 -mtune=haswell (last good version of gcc on Godbolt, for this test)

        leaq    131072(%rdi), %rsi
        xorl    %eax, %eax
        xorl    %ecx, %ecx          # extra zero constant for a cmov source
.L3:
        movslq  (%rdi), %rdx
        cmpl    $127, %edx
        cmovle  %rcx, %rdx          # rdx = 0 when rdx<=128
        addq    $4, %rdi
        addq    %rdx, %rax          # sum += ... critical path 1c latency
        cmpq    %rsi, %rdi
        jne     .L3
        ret

7 fused-domain uops in the loop (cmov is 2 with 2c latency before Broadwell). 
Should run at 1.75 cycles per iter on Haswell (or slightly slower due to an odd
number of uops in the loop buffer), bottlenecked on the front-end.  The latency
bottleneck is only 1 cycle.  (Which Ryzen might come closer to.)

Anyway, on Haswell (with -mtune=haswell), the function should be more than 1.5x
slower with gcc7/8 than with gcc6 and earlier.

Moreover, gcc should try to optimize something like this:

        if (data[c] >= 128)
            sum += data[c];

into conditionally zeroing a register instead of using a loop-carried cmov dep
chain.

[Bug tree-optimization/82666] New: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)

Reply via email to