https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943
Bug ID: 69943
Summary: expressions with multiple associative operators don't
always create instruction-level parallelism
Product: gcc
Version: 5.3.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
separate problems (which maybe should be separate bugs, let me know):
* associativity not exploited for ILP in integer operations
* using a mov from memory instead of an add
* FP ILP from associativity generates two extra mov instructions
gcc 5.3.0 -O3 (http://goo.gl/IRdw05) has two problems compiling this:
int sumi(int a, int b,int c,int d,int e,int f,int g,int h) {
return a+b+c+d+e+f+g+h;
}
addl %edi, %esi
movl 8(%rsp), %eax # when an arg comes from memory, it forgets
to use lea as a 3-arg add
addl %esi, %edx
addl %edx, %ecx
addl %ecx, %r8d
addl %r8d, %r9d
addl %r9d, %eax
addl 16(%rsp), %eax
The expression is evaluated most in order from left to right, not as
((a+b) + (c+d)) + ((e+f) + (g+h)). This gives is a latency of 8 clocks. If
the inputs became ready at one-per-clock, this would be ideal (only one add
depends on the last input), but we shouldn't assume that when we can't see the
code that generated them.
The same lack of parallelism happens on ARM, ARM64, and PPC.
---------
The FP version of the same *does* take advantage of associativity for
parallelism with -ffast-math, but uses two redundant mov instructions:
float sumf(float a, float b,float c,float d,float e,float f,float g,float h) {
return a+b+c+d+e+f+g+h;
}
addss %xmm4, %xmm5 # e, D.1876
addss %xmm6, %xmm7 # g, D.1876
addss %xmm2, %xmm3 # c, D.1876
addss %xmm0, %xmm1 # a, D.1876
addss %xmm7, %xmm5 # D.1876, D.1876
movaps %xmm5, %xmm2 # D.1876, D.1876
addss %xmm3, %xmm2 # D.1876, D.1876
movaps %xmm2, %xmm0 # D.1876, D.1876
addss %xmm1, %xmm0 # D.1876, D.1876
clang avoids any unnecessary instructions, but has less FP ILP, and the same
lack of integer ILP.
Interestingly, clang lightly auto-vectorizes sumf when the expression is
parenthesised for ILP, but only *without* -ffast-math. http://goo.gl/Pqjtu1.
As usual, IDK whether to mark this as RTL, tree-ssa, or middle-end. The
integer ILP problem is not target specific.