[Bug rtl-optimization/69943] New: expressions with multiple associative operators don't always create instruction-level parallelism

peter at cordes dot ca Wed, 24 Feb 2016 07:17:47 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943


            Bug ID: 69943
           Summary: expressions with multiple associative operators don't
                    always create instruction-level parallelism
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

separate problems (which maybe should be separate bugs, let me know):

* associativity not exploited for ILP in integer operations
* using a mov from memory instead of an add
* FP ILP from associativity generates two extra mov instructions


gcc 5.3.0 -O3 (http://goo.gl/IRdw05) has two problems compiling this:

int sumi(int a, int b,int c,int d,int e,int f,int g,int h) {
  return a+b+c+d+e+f+g+h;
}
        addl    %edi, %esi
        movl    8(%rsp), %eax     # when an arg comes from memory, it forgets
to use lea as a 3-arg add
        addl    %esi, %edx
        addl    %edx, %ecx
        addl    %ecx, %r8d
        addl    %r8d, %r9d
        addl    %r9d, %eax
        addl    16(%rsp), %eax

The expression is evaluated most in order from left to right, not as
((a+b) + (c+d)) + ((e+f) + (g+h)).  This gives is a latency of 8 clocks.  If
the inputs became ready at one-per-clock, this would be ideal (only one add
depends on the last input), but we shouldn't assume that when we can't see the
code that generated them.

The same lack of parallelism happens on ARM, ARM64, and PPC.

---------

The FP version of the same *does* take advantage of associativity for
parallelism with -ffast-math, but uses two redundant mov instructions:

float sumf(float a, float b,float c,float d,float e,float f,float g,float h) {
  return a+b+c+d+e+f+g+h;
}
        addss   %xmm4, %xmm5  # e, D.1876
        addss   %xmm6, %xmm7  # g, D.1876
        addss   %xmm2, %xmm3  # c, D.1876
        addss   %xmm0, %xmm1  # a, D.1876
        addss   %xmm7, %xmm5  # D.1876, D.1876
        movaps  %xmm5, %xmm2        # D.1876, D.1876
        addss   %xmm3, %xmm2  # D.1876, D.1876
        movaps  %xmm2, %xmm0        # D.1876, D.1876
        addss   %xmm1, %xmm0  # D.1876, D.1876

clang avoids any unnecessary instructions, but has less FP ILP, and the same
lack of integer ILP.

Interestingly, clang lightly auto-vectorizes sumf when the expression is
parenthesised for ILP, but only *without* -ffast-math.  http://goo.gl/Pqjtu1.

As usual, IDK whether to mark this as RTL, tree-ssa, or middle-end.  The
integer ILP problem is not target specific.

[Bug rtl-optimization/69943] New: expressions with multiple associative operators don't always create instruction-level parallelism

Reply via email to