https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550
Roger Sayle <roger at nextmovesoftware dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |roger at nextmovesoftware dot com --- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> --- I think this is a reassociation problem (rather than a register allocation issue). On x86-like machines, it's better to evaluate "*mem + op" as tmp=op; tmp += *mem"; taking advantage of the available addressing modes, than as "tmp = *mem; tmp2 = op; tmp += tmp2". This is demonstrated by the following simplified (and over-simplified) test case: inline unsigned long s0(unsigned long x) { return (x>>8) | (x<<56); } void foo(unsigned long *x) { x[0] += s0(x[1]) + x[9]; } void bar(unsigned long *x) { x[0] += s0(x[1]) + s0(x[14]) + x[9]; } In the over-simplified foo, GCC generates optimal code, performing the computation (rotation) first, then adding from memory, and finally adding to memory. foo: movq 8(%rdi), %rax rorq $8, %rax addq 72(%rdi), %rax addq %rax, (%rdi) ret which is much better than the (hypothetical) reverse order, which requires more instructions (the extra moves observed by cqwrteur) and extra registers. badfoo: movq (%rdi), %rax addq 72(%rdi), %rax movq 8(%rdi), %rdx rorq $8, %rdx addq %rdx, %rax movq %rax, (%rdi) ret Alas, this is exactly what happens for more complex cases, like bar above. With is currently compiled by mainline GCC to: bar: movq 8(%rdi), %rdx movq (%rdi), %rax addq 72(%rdi), %rax rorq $8, %rdx addq %rdx, %rax movq 112(%rdi), %rdx rorq $8, %rdx addq %rdx, %rax movq %rax, (%rdi) ret The issue appears in the tree-ssa optimizers. The .t006.gimple for bar initially has the additions of memory last: void bar (long unsigned int * x) { long unsigned int D.1990; _1 = x + 8; _2 = *_1; _3 = s0 (_2); _4 = x + 112; _5 = *_4; _6 = s0 (_5); _7 = _3 + _6; _8 = x + 72; _9 = *_8; D.1990 = _7 + _9; _10 = *x; _11 = D.1990 + _10; *x = _11; } but by the time the tree-ssa optimizers have finished with it, the reads from memory have been reassociated to the front, so 250t.optimized contains: void bar (long unsigned int * x) { long unsigned int _1; long unsigned int _2; long unsigned int _4; long unsigned int _5; long unsigned int _6; long unsigned int _9; long unsigned int _10; long unsigned int _13; long unsigned int _14; <bb 2> [local count: 1073741824]: _1 = MEM[(long unsigned int *)x_7(D) + 8B]; _9 = _1 r>> 8; _2 = MEM[(long unsigned int *)x_7(D) + 112B]; _10 = _2 r>> 8; _4 = MEM[(long unsigned int *)x_7(D) + 72B]; _5 = *x_7(D); _13 = _4 + _5; _14 = _9 + _13; _6 = _10 + _14; *x_7(D) = _6; return; } Notice the first addition, _4 + _5 mentions two memory references, rather than the computations _9 or _10. I believe that a suitable heuristic is that when reassociating, memory references should come last, or at least not first. Hence (mem1 + mem2 + op3 + mem4 + op5) should reassociate as (op3 + op5 + mem1 + mem2 + mem4). And ideally, any destination memory should appear last, so mem2 = (mem1 + mem2 + op3 + mem4 + op5) should be reassociated as mem2 = (op3 + op5 + mem1 + mem4 + mem2). I suspect the intent of moving memory references "early" is perhaps to reduce memory latency, so perhaps the ideal compromise/hybrid is to reassociate a computation first, then memory accesses as early as possible, and finally the destination memory last, i.e. mem2 = (op3 + mem1 + mem4 + op5 + mem2).