https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98856
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> --- There's another thing - we end up with vmovq %rax, %xmm3 vpinsrq $1, %rdx, %xmm3, %xmm0 but that has way worse latency than the alternative you'd get w/o SSE 4.1: vmovq %rax, %xmm3 vmovq %rdx, %xmm7 punpcklqdq %xmm7, %xmm3 for example on Zen3 vmovq and vpisnrq have latencies of 3 while punpck has a latency of only one. So the second variant should have 2 cycles less latency. Testcase: typedef long v2di __attribute__((vector_size(16))); v2di foo (long a, long b) { return (v2di){a, b}; } Even on Skylake it's 2 (movq) + 3 (vpinsr), so there it's 6 vs. 3. Not sure if we should somehow do this late somehow (peephole or splitter) since it requires one more %xmm register.