https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102178
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Component|tree-optimization |rtl-optimization CC| |vmakarov at gcc dot gnu.org Keywords| |ra --- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- I see a lot more GPR <-> XMM moves in the 'after' case: 1035 : 401c8b: vaddsd %xmm1,%xmm0,%xmm0 1953 : 401c8f: vmovq %rcx,%xmm1 305 : 401c94: vaddsd %xmm8,%xmm1,%xmm1 3076 : 401c99: vmovq %xmm0,%r14 590 : 401c9e: vmovq %r11,%xmm0 267 : 401ca3: vmovq %xmm1,%r8 136 : 401ca8: vmovq %rdx,%xmm1 448 : 401cad: vaddsd %xmm1,%xmm0,%xmm1 1703 : 401cb1: vmovq %xmm1,%r9 (*) 834 : 401cb6: vmovq %r8,%xmm1 1719 : 401cbb: vmovq %r9,%xmm0 (*) 2782 : 401cc0: vaddsd %xmm0,%xmm1,%xmm1 22135 : 401cc4: vmovsd %xmm1,%xmm1,%xmm0 1261 : 401cc8: vmovq %r14,%xmm1 646 : 401ccd: vaddsd %xmm0,%xmm1,%xmm0 18136 : 401cd1: vaddsd %xmm2,%xmm5,%xmm1 629 : 401cd5: vmovq %xmm1,%r8 142 : 401cda: vaddsd %xmm6,%xmm3,%xmm1 177 : 401cde: vmovq %xmm0,%r14 288 : 401ce3: vmovq %xmm1,%r9 177 : 401ce8: vmovq %r8,%xmm1 174 : 401ced: vmovq %r9,%xmm0 those look like RA / spilling artifacts and IIRC I saw Hongtao posting patches in this area to regcprop I think? The above is definitely bad, for example (*) seems to swap %xmm0 and %xmm1 via %r9. The function is LBM_performStreamCollide, the sinking pass does nothing wrong, it moves unconditionally executed - _948 = _861 + _867; - _957 = _944 + _948; - _912 = _861 + _873; ... - _981 = _853 + _865; - _989 = _977 + _981; - _916 = _853 + _857; - _924 = _912 + _916; into a conditionally executed block. But that increases register pressure by 5 FP regs (if I counted correctly) in that area. So this would be the usual issue of GIMPLE transforms not being register-pressure aware. -fschedule-insn -fsched-pressure seems to be able to somewhat mitigate this (though I think EBB scheduling cannot undo such movement). In postreload I see transforms like -(insn 466 410 411 7 (set (reg:DF 0 ax [530]) - (mem/u/c:DF (symbol_ref/u:DI ("*.LC10") [flags 0x2]) [0 S8 A64])) "lbm.c":241:5 141 {*movdf_internal} - (expr_list:REG_EQUAL (const_double:DF 9.939744999999999830464503247640095651149749755859375e-1 [0x0.fe751ce28ed5fp+0]) - (nil))) -(insn 411 466 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123]) +(insn 411 410 467 7 (set (reg:DF 25 xmm5 [orig:123 prephitmp_643 ] [123]) (reg:DF 0 ax [530])) "lbm.c":241:5 141 {*movdf_internal} (nil)) which seems like we could have reloaded %xmm5 from .LC10. But the spilling to GPRs seems to be present already after LRA and cprop_hardreg doesn't do anything bad either. The differences can be seen on trunk with -Ofast -march=znver2 [-fdisable-tree-sink2]. We have X86_TUNE_INTER_UNIT_MOVES_TO_VEC/X86_TUNE_INTER_UNIT_MOVES_FROM_VEC and the interesting thing is that when I disable them I do see some spilling to the stack but also quite some re-materialized constants (loads from .LC* as seem from the opportunity above). It might be interesting to benchmark with -mtune-ctrl=^inter_unit_moves_from_vec,^inter_unit_moves_to_vec and find a way to make costs in a way that IRA/LRA prefer re-materialization of constants from the constant pool over spilling to GPRs (if that's possible at all - Vlad?)