https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103550

Roger Sayle <roger at nextmovesoftware dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |roger at nextmovesoftware dot 
com

--- Comment #8 from Roger Sayle <roger at nextmovesoftware dot com> ---
I think this is a reassociation problem (rather than a register allocation
issue).
On x86-like machines, it's better to evaluate "*mem + op" as tmp=op; tmp +=
*mem";  taking advantage of the available addressing modes, than as "tmp =
*mem; tmp2 = op;  tmp += tmp2".

This is demonstrated by the following simplified (and over-simplified) test
case:
inline unsigned long s0(unsigned long x) {
  return (x>>8) | (x<<56);
}

void foo(unsigned long *x) {
  x[0] += s0(x[1]) + x[9];
}

void bar(unsigned long *x) {
  x[0] += s0(x[1]) + s0(x[14]) + x[9];
}

In the over-simplified foo, GCC generates optimal code, performing the
computation (rotation) first, then adding from memory, and finally adding to
memory.

foo:    movq    8(%rdi), %rax
        rorq    $8, %rax
        addq    72(%rdi), %rax
        addq    %rax, (%rdi)
        ret

which is much better than the (hypothetical) reverse order, which requires
more instructions (the extra moves observed by cqwrteur) and extra registers.

badfoo: movq    (%rdi), %rax
        addq    72(%rdi), %rax
        movq    8(%rdi), %rdx
        rorq    $8, %rdx
        addq    %rdx, %rax
        movq    %rax, (%rdi)
        ret

Alas, this is exactly what happens for more complex cases, like bar above.
With is currently compiled by mainline GCC to:

bar:    movq    8(%rdi), %rdx
        movq    (%rdi), %rax
        addq    72(%rdi), %rax
        rorq    $8, %rdx
        addq    %rdx, %rax
        movq    112(%rdi), %rdx
        rorq    $8, %rdx
        addq    %rdx, %rax
        movq    %rax, (%rdi)
        ret

The issue appears in the tree-ssa optimizers.  The .t006.gimple for bar
initially has the additions of memory last:

void bar (long unsigned int * x)
{
  long unsigned int D.1990;

  _1 = x + 8;
  _2 = *_1;
  _3 = s0 (_2);
  _4 = x + 112;
  _5 = *_4;
  _6 = s0 (_5);
  _7 = _3 + _6;
  _8 = x + 72;
  _9 = *_8;
  D.1990 = _7 + _9;
  _10 = *x;
  _11 = D.1990 + _10;
  *x = _11;
}

but by the time the tree-ssa optimizers have finished with it, the reads from
memory have been reassociated to the front, so 250t.optimized contains:

void bar (long unsigned int * x)
{
  long unsigned int _1;
  long unsigned int _2;
  long unsigned int _4;
  long unsigned int _5;
  long unsigned int _6;
  long unsigned int _9;
  long unsigned int _10;
  long unsigned int _13;
  long unsigned int _14;

  <bb 2> [local count: 1073741824]:
  _1 = MEM[(long unsigned int *)x_7(D) + 8B];
  _9 = _1 r>> 8;
  _2 = MEM[(long unsigned int *)x_7(D) + 112B];
  _10 = _2 r>> 8;
  _4 = MEM[(long unsigned int *)x_7(D) + 72B];
  _5 = *x_7(D);
  _13 = _4 + _5;
  _14 = _9 + _13;
  _6 = _10 + _14;
  *x_7(D) = _6;
  return;
}

Notice the first addition, _4 + _5 mentions two memory references, rather than
the computations _9 or _10. I believe that a suitable heuristic is that when
reassociating, memory references should come last, or at least not first.
Hence (mem1 + mem2 + op3 + mem4 + op5) should reassociate as (op3 + op5 + mem1
+ mem2 + mem4).  And ideally, any destination memory should appear last, so
mem2 = (mem1 + mem2 + op3 + mem4 + op5) should be reassociated as mem2 = (op3 +
op5 + mem1 + mem4 + mem2).

I suspect the intent of moving memory references "early" is perhaps to reduce
memory latency, so perhaps the ideal compromise/hybrid is to reassociate a
computation first, then memory accesses as early as possible, and finally the
destination memory last, i.e. mem2 = (op3 + mem1 + mem4 + op5 + mem2).

Reply via email to