https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96562
--- Comment #3 from Hongtao.liu <crazylht at gmail dot com> --- a simple c testcase typedef struct { unsigned char* p; unsigned int a; }st; st foo (unsigned char* p, unsigned char* q) { return {p, (unsigned int)(q-p)}; } There's two issues here. 1. gcc use memory to move from xmm to gpr. --- vmovdqa XMMWORD PTR [rsp-24], xmm0 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] --- 2. gcc use vpinsrd to initialize st.a which is suboptimal after reload. (insn 9 24 23 2 (set (reg:V4SI 20 xmm0 [89]) (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 4 si [88])) (reg:V4SI 21 xmm1 [94]) (const_int 4 [0x4]))) "../test.c":9:42 4387 {sse4_1_pinsrd} (nil))