https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57534
Aldy Hernandez <aldyh at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target|i?86-*-* |i?86-*-*, x86-64 --- Comment #20 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- For the record, an even smaller test that I believe shows the problem even on x86-64: int ind; int cond(void); double hand_benchmark_cache_ronly( double *x) { double sum=0.0; while (cond()) sum += x[ind] + x[ind+1] + x[ind+2] + x[ind+3]; return sum; } with -O2 we get an extra lea in the loop: movslq ind(%rip), %rdx leaq 0(,%rdx,8), %rax <-- BOO! movsd 8(%rbx,%rax), %xmm0 addsd (%rbx,%rdx,8), %xmm0 addsd 16(%rbx,%rax), %xmm0 addsd 24(%rbx,%rax), %xmm0 addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) whereas with -O2 -fno-tree-slsr we get: movslq ind(%rip), %rax movsd 8(%rbx,%rax,8), %xmm0 addsd (%rbx,%rax,8), %xmm0 addsd 16(%rbx,%rax,8), %xmm0 addsd 24(%rbx,%rax,8), %xmm0 addsd 8(%rsp), %xmm0 movsd %xmm0, 8(%rsp) The .optimized dump for -O2 shows ind*8 being CSE'd away, and the address being calculated as "ind*8 + CST": _2 = (long unsigned int) ind.0_1; _3 = _2 * 8; ;; common expression: ind*8 _4 = x_26(D) + _3; _5 = *_4; _7 = _3 + 8; ;; ind*8 + 8 _8 = x_26(D) + _7; _9 = *_8; ... Whereas with -O2 -fno-tree-slsr, the address is calculated as "(ind+CST) * 8 + x" ind.0_1 = ind; _2 = (long unsigned int) ind.0_1; _3 = _2 * 8; _4 = x_26(D) + _3; _5 = *_4; _6 = _2 + 1; _7 = _6 * 8; ;; (ind+1) * 8 _8 = x_26(D) + _7; ;; (ind+1) * 8 + x _9 = *_8; Ironically the -O2 gimple looks more efficient, but gets crappy addressing on x86.