8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

aldyh at gcc dot gnu.org Wed, 28 Feb 2018 01:55:19 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534


Aldy Hernandez <aldyh at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|i?86-*-*                    |i?86-*-*, x86-64

--- Comment #20 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
For the record, an even smaller test that I believe shows the problem even on
x86-64:

int ind;
int cond(void);

double hand_benchmark_cache_ronly( double *x) {
    double sum=0.0;
    while (cond())
        sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3];
    return sum;
}

with -O2 we get an extra lea in the loop:

        movslq  ind(%rip), %rdx
        leaq    0(,%rdx,8), %rax        <-- BOO!
        movsd   8(%rbx,%rax), %xmm0
        addsd   (%rbx,%rdx,8), %xmm0
        addsd   16(%rbx,%rax), %xmm0
        addsd   24(%rbx,%rax), %xmm0
        addsd   8(%rsp), %xmm0
        movsd   %xmm0, 8(%rsp)

whereas with -O2 -fno-tree-slsr we get:

        movslq  ind(%rip), %rax
        movsd   8(%rbx,%rax,8), %xmm0
        addsd   (%rbx,%rax,8), %xmm0
        addsd   16(%rbx,%rax,8), %xmm0
        addsd   24(%rbx,%rax,8), %xmm0
        addsd   8(%rsp), %xmm0
        movsd   %xmm0, 8(%rsp)

The .optimized dump for -O2 shows ind*8 being CSE'd away, and the address being
calculated as "ind*8 + CST":

  _2 = (long unsigned int) ind.0_1;
  _3 = _2 * 8;          ;; common expression: ind*8
  _4 = x_26(D) + _3;
  _5 = *_4;
  _7 = _3 + 8;          ;; ind*8 + 8
  _8 = x_26(D) + _7;
  _9 = *_8;
...

Whereas with -O2 -fno-tree-slsr, the address is calculated as "(ind+CST) * 8 +
x"

  ind.0_1 = ind;
  _2 = (long unsigned int) ind.0_1;
  _3 = _2 * 8;
  _4 = x_26(D) + _3;
  _5 = *_4;
  _6 = _2 + 1;
  _7 = _6 * 8;          ;; (ind+1) * 8
  _8 = x_26(D) + _7;    ;; (ind+1) * 8 + x
  _9 = *_8;

Ironically the -O2 gimple looks more efficient, but gets crappy addressing on
x86.

[Bug tree-optimization/57534] [6/7/8 Regression]: Performance regression versus 4.7.3, 4.8.1 is ~15% slower

Reply via email to