[Bug target/118125] [15 Regression] 7-16% slowdown of 510.parest_r on x86-64(-v3) since r15-6110-g92e0e0f8177530

jamborm at gcc dot gnu.org via Gcc-bugs Wed, 29 Jan 2025 06:10:48 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118125


--- Comment #8 from Martin Jambor <jamborm at gcc dot gnu.org> ---
I guess I should have started with looking at annotated assembly. The
hot loop in the hot functions changes from:

    53 : ,-> 5534e0: lea    (%r11,%rax,1),%rsi
   659 : |   5534e4: mov    (%rsi),%edi
   485 : |   5534e6: mov    0x4(%rsi),%r8d
  2173 : |   5534ea: vmovsd (%rcx,%rdi,8),%xmm0
  1032 : |   5534ef: mov    0x8(%rsi),%edi
  1673 : |   5534f2: mov    0xc(%rsi),%esi
   550 : |   5534f5: vmovhpd (%rcx,%r8,8),%xmm0,%xmm0
 24357 : |   5534fb: vmovsd (%rcx,%rdi,8),%xmm2
   900 : |   553500: vmovhpd (%rcx,%rsi,8),%xmm2,%xmm2
       : |        s += *val_ptr++ * src(*colnum_ptr++);
  2198 : |   553505: vmulpd 0x10(%rdx,%rax,2),%xmm2,%xmm2
 10806 : |   55350b: vfmadd132pd (%rdx,%rax,2),%xmm2,%xmm0
 19463 : |   553511: add    $0x10,%rax
   158 : |   553515: vaddpd %xmm0,%xmm1,%xmm1
       : |        while (val_ptr != val_end_of_row)
 65079 : |   553519: cmp    -0x538(%rbp),%rax
   689 : '-- 553520: jne    5534e0 

to:

     7 : ,-> 5535a0: lea    (%rdi,%r10,1),%rdx
       : |        return val[i];
   408 : |   5535a4: mov    %rdx,-0x500(%rbp)
  2231 : |   5535ab: mov    (%rdx),%edx
   420 : |   5535ad: mov    %rdx,-0x538(%rbp)
    59 : |   5535b4: mov    -0x500(%rbp),%rdx
  2214 : |   5535bb: mov    0x4(%rdx),%edx
   658 : |   5535be: mov    %rdx,-0x540(%rbp)
 21572 : |   5535c5: mov    -0x538(%rbp),%rdx
  1916 : |   5535cc: vmovsd (%r9,%rdx,8),%xmm0
   987 : |   5535d2: mov    -0x540(%rbp),%rdx
  2341 : |   5535d9: vmovhpd (%r9,%rdx,8),%xmm0,%xmm0
  9349 : |   5535df: mov    -0x500(%rbp),%rdx
   117 : |   5535e6: mov    0x8(%rdx),%edx
  1162 : |   5535e9: mov    %rdx,-0x538(%rbp)
   581 : |   5535f0: mov    -0x500(%rbp),%rdx
 18660 : |   5535f7: mov    0xc(%rdx),%edx
  1778 : |   5535fa: mov    %rdx,-0x500(%rbp)
   271 : |   553601: mov    -0x538(%rbp),%rdx
  2605 : |   553608: vmovsd (%r9,%rdx,8),%xmm2
  4943 : |   55360e: mov    -0x500(%rbp),%rdx
  1206 : |   553615: vmovhpd (%r9,%rdx,8),%xmm2,%xmm2
       : |        s += *val_ptr++ * src(*colnum_ptr++);
 11703 : |   55361b: vmulpd 0x10(%rax,%r10,2),%xmm2,%xmm2
 56077 : |   553622: vfmadd132pd (%rax,%r10,2),%xmm2,%xmm0
 47327 : |   553628: add    $0x10,%r10
   871 : |   55362c: vaddpd %xmm0,%xmm1,%xmm1
       : |        while (val_ptr != val_end_of_row)
 66067 : |   553630: cmp    %r11,%r10
  1762 : `-- 553633: jne    5535a0 

So it looks like register allocation/spilling issue.

The gimple IL of the loop is the same in both cases, but the "local
count" of the BB with the loop body (in the optimized dump) is
3540039452134 in the fast version and only 832066009199 (so down ~77%).

[Bug target/118125] [15 Regression] 7-16% slowdown of 510.parest_r on x86-64(-v3) since r15-6110-g92e0e0f8177530

Reply via email to