https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550
Roger Sayle <roger at nextmovesoftware dot com> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |roger at nextmovesoftware dot
com
--- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> ---
I think this is a reassociation problem (rather than a register allocation
issue).
On x86-like machines, it's better to evaluate "*mem + op" as tmp=op; tmp +=
*mem"; taking advantage of the available addressing modes, than as "tmp =
*mem; tmp2 = op; tmp += tmp2".
This is demonstrated by the following simplified (and over-simplified) test
case:
inline unsigned long s0(unsigned long x) {
return (x>>8) | (x<<56);
}
void foo(unsigned long *x) {
x[0] += s0(x[1]) + x[9];
}
void bar(unsigned long *x) {
x[0] += s0(x[1]) + s0(x[14]) + x[9];
}
In the over-simplified foo, GCC generates optimal code, performing the
computation (rotation) first, then adding from memory, and finally adding to
memory.
foo: movq 8(%rdi), %rax
rorq $8, %rax
addq 72(%rdi), %rax
addq %rax, (%rdi)
ret
which is much better than the (hypothetical) reverse order, which requires
more instructions (the extra moves observed by cqwrteur) and extra registers.
badfoo: movq (%rdi), %rax
addq 72(%rdi), %rax
movq 8(%rdi), %rdx
rorq $8, %rdx
addq %rdx, %rax
movq %rax, (%rdi)
ret
Alas, this is exactly what happens for more complex cases, like bar above.
With is currently compiled by mainline GCC to:
bar: movq 8(%rdi), %rdx
movq (%rdi), %rax
addq 72(%rdi), %rax
rorq $8, %rdx
addq %rdx, %rax
movq 112(%rdi), %rdx
rorq $8, %rdx
addq %rdx, %rax
movq %rax, (%rdi)
ret
The issue appears in the tree-ssa optimizers. The .t006.gimple for bar
initially has the additions of memory last:
void bar (long unsigned int * x)
{
long unsigned int D.1990;
_1 = x + 8;
_2 = *_1;
_3 = s0 (_2);
_4 = x + 112;
_5 = *_4;
_6 = s0 (_5);
_7 = _3 + _6;
_8 = x + 72;
_9 = *_8;
D.1990 = _7 + _9;
_10 = *x;
_11 = D.1990 + _10;
*x = _11;
}
but by the time the tree-ssa optimizers have finished with it, the reads from
memory have been reassociated to the front, so 250t.optimized contains:
void bar (long unsigned int * x)
{
long unsigned int _1;
long unsigned int _2;
long unsigned int _4;
long unsigned int _5;
long unsigned int _6;
long unsigned int _9;
long unsigned int _10;
long unsigned int _13;
long unsigned int _14;
<bb 2> [local count: 1073741824]:
_1 = MEM[(long unsigned int *)x_7(D) + 8B];
_9 = _1 r>> 8;
_2 = MEM[(long unsigned int *)x_7(D) + 112B];
_10 = _2 r>> 8;
_4 = MEM[(long unsigned int *)x_7(D) + 72B];
_5 = *x_7(D);
_13 = _4 + _5;
_14 = _9 + _13;
_6 = _10 + _14;
*x_7(D) = _6;
return;
}
Notice the first addition, _4 + _5 mentions two memory references, rather than
the computations _9 or _10. I believe that a suitable heuristic is that when
reassociating, memory references should come last, or at least not first.
Hence (mem1 + mem2 + op3 + mem4 + op5) should reassociate as (op3 + op5 + mem1
+ mem2 + mem4). And ideally, any destination memory should appear last, so
mem2 = (mem1 + mem2 + op3 + mem4 + op5) should be reassociated as mem2 = (op3 +
op5 + mem1 + mem4 + mem2).
I suspect the intent of moving memory references "early" is perhaps to reduce
memory latency, so perhaps the ideal compromise/hybrid is to reassociate a
computation first, then memory accesses as early as possible, and finally the
destination memory last, i.e. mem2 = (op3 + mem1 + mem4 + op5 + mem2).