https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104582
--- Comment #17 from Richard Biener <rguenth at gcc dot gnu.org> --- For FAIL: gcc.target/i386/pr91446.c scan-assembler-times vmovdqa[^\\n\\r]*xmm[0-9] 2 we used to produce 0000000000000000 <foo>: 0: 48 83 ec 28 sub $0x28,%rsp 4: c4 e1 f9 6e d7 vmovq %rdi,%xmm2 9: c4 e1 f9 6e da vmovq %rdx,%xmm3 e: c4 e3 e9 22 ce 01 vpinsrq $0x1,%rsi,%xmm2,%xmm1 14: c4 e3 e1 22 c1 01 vpinsrq $0x1,%rcx,%xmm3,%xmm0 1a: 48 89 e7 mov %rsp,%rdi 1d: c5 f9 7f 0c 24 vmovdqa %xmm1,(%rsp) 22: c5 f9 7f 44 24 10 vmovdqa %xmm0,0x10(%rsp) 28: e8 00 00 00 00 call 2d <foo+0x2d> 2d: 48 83 c4 28 add $0x28,%rsp 31: c3 ret but now reject this on costing grounds. The scalar code is 0000000000000000 <foo>: 0: 48 83 ec 28 sub $0x28,%rsp 4: 48 89 3c 24 mov %rdi,(%rsp) 8: 48 89 e7 mov %rsp,%rdi b: 48 89 74 24 08 mov %rsi,0x8(%rsp) 10: 48 89 54 24 10 mov %rdx,0x10(%rsp) 15: 48 89 4c 24 18 mov %rcx,0x18(%rsp) 1a: e8 00 00 00 00 call 1f <foo+0x1f> 1f: 48 83 c4 28 add $0x28,%rsp 23: c3 ret I think the scalar variant is 5 uops up to the call while the vector variant is 9 uops. The scalar variant can also execute 4 of the uops in parallel (well, I guess only up to 3 with 3 store ports). I think the scalar variant is better and so I'm inclined to adjust the testcase.