https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> --- No good idea how to tackle such issues. Possibly a mdreorg pass could for the code region "near" to the function prologue scan for loads that are known to access the arguments in a way conflicting with how GCC itself would pass them with respect to STLF and split those. --- c-ray-f.s 2022-01-20 12:00:41.660954367 +0100 +++ c-ray-f.s.fixed 2022-01-20 12:00:38.160908539 +0100 @@ -334,8 +334,12 @@ .cfi_def_cfa_offset 160 movupd (%rdi), %xmm5 movsd 16(%rdi), %xmm9 - movupd 184(%rsp), %xmm13 - movupd 160(%rsp), %xmm15 + movsd 184(%rsp), %xmm13 + movhpd 192(%rsp), %xmm13 +# movupd 184(%rsp), %xmm13 + movsd 160(%rsp), %xmm15 + movhpd 168(%rsp), %xmm15 +# movupd 160(%rsp), %xmm15 movsd 176(%rsp), %xmm10 movaps %xmm5, 16(%rsp) unpckhpd %xmm5, %xmm5 indeed improves performance back to previous levels. That's the ray_sphere "prologue", preceeding is only ray_sphere: .LFB33: .cfi_startproc subq $152, %rsp At .stv1/.stv2 we see (note 4 3 11 2 NOTE_INSN_FUNCTION_BEG) (insn 11 4 13 2 (set (reg:V2DF 174 [ vect_ray_orig_x_87.270 ]) (mem/c:V2DF (reg/f:DI 16 argp) [1 MEM <vector(2) double> [(double *)&ray]+0 S16 A64])) 1673 {movv2df_internal} (nil)) ... (insn 16 15 18 2 (set (reg:V2DF 178 [ vect_ray_dir_x_90.266 ]) (mem/c:V2DF (plus:DI (reg/f:DI 16 argp) (const_int 24 [0x18])) [1 MEM <vector(2) double> [(double *)&ray + 24B]+0 S16 A64])) 1673 {movv2df_internal} (nil)) at the classic mdreorg place it is (insn:TI 16 30 11 2 (set (reg:V2DF 49 xmm13 [orig:178 vect_ray_dir_x_90.266 ] [178]) (mem/c:V2DF (plus:DI (reg/f:DI 7 sp) (const_int 184 [0xb8])) [1 MEM <vector(2) double> [(double *)&ray + 24B]+0 S16 A64])) 1673 {movv2df_internal} (nil)) (insn 11 16 15 2 (set (reg:V2DF 51 xmm15 [orig:174 vect_ray_orig_x_87.270 ] [174]) (mem/c:V2DF (plus:DI (reg/f:DI 7 sp) (const_int 160 [0xa0])) [1 MEM <vector(2) double> [(double *)&ray]+0 S16 A64])) 1673 {movv2df_internal} (nil)) both might have enough info to tell that we load from an argument and how that argument was passed. But I don't know enough RTL details to say how difficult it would be to split vector loads from the argument space if it is "misaligned" compared to the argument passing sequence. I do wonder though how CLX is fine with such access pattern ;) (did you test with just -O2?)