https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81496
--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Maybe even better would be to emit vmovq %r1, %xmm0; vpinsrq $1, %r2, %xmm0; vpinsrq $2, %r3, %ymm0; vpinsrq $3, %r4, %ymm0; but not sure how to achieve that. For another testcase: typedef long long W __attribute__((vector_size (32))); W f1 (long long x, long long y, long long z, long long w) { return (W) { x, y, z, w }; } W f2 (long long x, long long y, long long z, long long w) { return (W) { w, z, y, x }; } we emit with -O3 -mavx2 -mtune=intel: vmovq %rsi, %xmm2 vmovq %rcx, %xmm3 vpinsrq $1, %rdi, %xmm2, %xmm1 vpinsrq $1, %rdx, %xmm3, %xmm0 vinserti128 $0x1, %xmm1, %ymm0, %ymm0 and here again, I wonder if vmovq + 3x vpinsrq wouldn't be better. In that case, handling this in i386.c ix86_expand_vector_init or helpers thereof would be possible. Guess it should be benchmarked on various CPUs.