https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118125
--- Comment #8 from Martin Jambor <jamborm at gcc dot gnu.org> --- I guess I should have started with looking at annotated assembly. The hot loop in the hot functions changes from: 53 : ,-> 5534e0: lea (%r11,%rax,1),%rsi 659 : | 5534e4: mov (%rsi),%edi 485 : | 5534e6: mov 0x4(%rsi),%r8d 2173 : | 5534ea: vmovsd (%rcx,%rdi,8),%xmm0 1032 : | 5534ef: mov 0x8(%rsi),%edi 1673 : | 5534f2: mov 0xc(%rsi),%esi 550 : | 5534f5: vmovhpd (%rcx,%r8,8),%xmm0,%xmm0 24357 : | 5534fb: vmovsd (%rcx,%rdi,8),%xmm2 900 : | 553500: vmovhpd (%rcx,%rsi,8),%xmm2,%xmm2 : | s += *val_ptr++ * src(*colnum_ptr++); 2198 : | 553505: vmulpd 0x10(%rdx,%rax,2),%xmm2,%xmm2 10806 : | 55350b: vfmadd132pd (%rdx,%rax,2),%xmm2,%xmm0 19463 : | 553511: add $0x10,%rax 158 : | 553515: vaddpd %xmm0,%xmm1,%xmm1 : | while (val_ptr != val_end_of_row) 65079 : | 553519: cmp -0x538(%rbp),%rax 689 : '-- 553520: jne 5534e0 to: 7 : ,-> 5535a0: lea (%rdi,%r10,1),%rdx : | return val[i]; 408 : | 5535a4: mov %rdx,-0x500(%rbp) 2231 : | 5535ab: mov (%rdx),%edx 420 : | 5535ad: mov %rdx,-0x538(%rbp) 59 : | 5535b4: mov -0x500(%rbp),%rdx 2214 : | 5535bb: mov 0x4(%rdx),%edx 658 : | 5535be: mov %rdx,-0x540(%rbp) 21572 : | 5535c5: mov -0x538(%rbp),%rdx 1916 : | 5535cc: vmovsd (%r9,%rdx,8),%xmm0 987 : | 5535d2: mov -0x540(%rbp),%rdx 2341 : | 5535d9: vmovhpd (%r9,%rdx,8),%xmm0,%xmm0 9349 : | 5535df: mov -0x500(%rbp),%rdx 117 : | 5535e6: mov 0x8(%rdx),%edx 1162 : | 5535e9: mov %rdx,-0x538(%rbp) 581 : | 5535f0: mov -0x500(%rbp),%rdx 18660 : | 5535f7: mov 0xc(%rdx),%edx 1778 : | 5535fa: mov %rdx,-0x500(%rbp) 271 : | 553601: mov -0x538(%rbp),%rdx 2605 : | 553608: vmovsd (%r9,%rdx,8),%xmm2 4943 : | 55360e: mov -0x500(%rbp),%rdx 1206 : | 553615: vmovhpd (%r9,%rdx,8),%xmm2,%xmm2 : | s += *val_ptr++ * src(*colnum_ptr++); 11703 : | 55361b: vmulpd 0x10(%rax,%r10,2),%xmm2,%xmm2 56077 : | 553622: vfmadd132pd (%rax,%r10,2),%xmm2,%xmm0 47327 : | 553628: add $0x10,%r10 871 : | 55362c: vaddpd %xmm0,%xmm1,%xmm1 : | while (val_ptr != val_end_of_row) 66067 : | 553630: cmp %r11,%r10 1762 : `-- 553633: jne 5535a0 So it looks like register allocation/spilling issue. The gimple IL of the loop is the same in both cases, but the "local count" of the BB with the loop body (in the optimized dump) is 3540039452134 in the fast version and only 832066009199 (so down ~77%).