[Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 20 Jan 2022 03:12:20 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908


--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> ---
No good idea how to tackle such issues.  Possibly a mdreorg pass could for the
code region "near" to the function prologue scan for loads that are known to
access the arguments in a way conflicting with how GCC itself would pass
them with respect to STLF and split those.

--- c-ray-f.s   2022-01-20 12:00:41.660954367 +0100
+++ c-ray-f.s.fixed     2022-01-20 12:00:38.160908539 +0100
@@ -334,8 +334,12 @@
        .cfi_def_cfa_offset 160
        movupd  (%rdi), %xmm5
        movsd   16(%rdi), %xmm9
-       movupd  184(%rsp), %xmm13
-       movupd  160(%rsp), %xmm15
+       movsd   184(%rsp), %xmm13
+       movhpd  192(%rsp), %xmm13
+#      movupd  184(%rsp), %xmm13
+       movsd   160(%rsp), %xmm15
+       movhpd  168(%rsp), %xmm15
+#      movupd  160(%rsp), %xmm15
        movsd   176(%rsp), %xmm10
        movaps  %xmm5, 16(%rsp)
        unpckhpd        %xmm5, %xmm5

indeed improves performance back to previous levels.  That's the ray_sphere
"prologue", preceeding is only

ray_sphere:
.LFB33:
        .cfi_startproc
        subq    $152, %rsp


At .stv1/.stv2 we see

(note 4 3 11 2 NOTE_INSN_FUNCTION_BEG)
(insn 11 4 13 2 (set (reg:V2DF 174 [ vect_ray_orig_x_87.270 ])
        (mem/c:V2DF (reg/f:DI 16 argp) [1 MEM <vector(2) double> [(double
*)&ray]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))
...
(insn 16 15 18 2 (set (reg:V2DF 178 [ vect_ray_dir_x_90.266 ])
        (mem/c:V2DF (plus:DI (reg/f:DI 16 argp)
                (const_int 24 [0x18])) [1 MEM <vector(2) double> [(double
*)&ray + 24B]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))

at the classic mdreorg place it is

(insn:TI 16 30 11 2 (set (reg:V2DF 49 xmm13 [orig:178 vect_ray_dir_x_90.266 ]
[178])
        (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
                (const_int 184 [0xb8])) [1 MEM <vector(2) double> [(double
*)&ray + 24B]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))
(insn 11 16 15 2 (set (reg:V2DF 51 xmm15 [orig:174 vect_ray_orig_x_87.270 ]
[174])
        (mem/c:V2DF (plus:DI (reg/f:DI 7 sp)
                (const_int 160 [0xa0])) [1 MEM <vector(2) double> [(double
*)&ray]+0 S16 A64])) 1673 {movv2df_internal}
     (nil))

both might have enough info to tell that we load from an argument and how
that argument was passed.  But I don't know enough RTL details to say
how difficult it would be to split vector loads from the argument space
if it is "misaligned" compared to the argument passing sequence.

I do wonder though how CLX is fine with such access pattern ;)  (did you test
with just -O2?)

[Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

Reply via email to