https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943
Bug ID: 69943 Summary: expressions with multiple associative operators don't always create instruction-level parallelism Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- separate problems (which maybe should be separate bugs, let me know): * associativity not exploited for ILP in integer operations * using a mov from memory instead of an add * FP ILP from associativity generates two extra mov instructions gcc 5.3.0 -O3 (http://goo.gl/IRdw05) has two problems compiling this: int sumi(int a, int b,int c,int d,int e,int f,int g,int h) { return a+b+c+d+e+f+g+h; } addl %edi, %esi movl 8(%rsp), %eax # when an arg comes from memory, it forgets to use lea as a 3-arg add addl %esi, %edx addl %edx, %ecx addl %ecx, %r8d addl %r8d, %r9d addl %r9d, %eax addl 16(%rsp), %eax The expression is evaluated most in order from left to right, not as ((a+b) + (c+d)) + ((e+f) + (g+h)). This gives is a latency of 8 clocks. If the inputs became ready at one-per-clock, this would be ideal (only one add depends on the last input), but we shouldn't assume that when we can't see the code that generated them. The same lack of parallelism happens on ARM, ARM64, and PPC. --------- The FP version of the same *does* take advantage of associativity for parallelism with -ffast-math, but uses two redundant mov instructions: float sumf(float a, float b,float c,float d,float e,float f,float g,float h) { return a+b+c+d+e+f+g+h; } addss %xmm4, %xmm5 # e, D.1876 addss %xmm6, %xmm7 # g, D.1876 addss %xmm2, %xmm3 # c, D.1876 addss %xmm0, %xmm1 # a, D.1876 addss %xmm7, %xmm5 # D.1876, D.1876 movaps %xmm5, %xmm2 # D.1876, D.1876 addss %xmm3, %xmm2 # D.1876, D.1876 movaps %xmm2, %xmm0 # D.1876, D.1876 addss %xmm1, %xmm0 # D.1876, D.1876 clang avoids any unnecessary instructions, but has less FP ILP, and the same lack of integer ILP. Interestingly, clang lightly auto-vectorizes sumf when the expression is parenthesised for ILP, but only *without* -ffast-math. http://goo.gl/Pqjtu1. As usual, IDK whether to mark this as RTL, tree-ssa, or middle-end. The integer ILP problem is not target specific.