https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80695
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- On x86_64 vectorization is not profitable, likely due to the higher cost of unaligned vector stores? But yes, I can see that vectorizing it as _24 = VIEW_CONVERT_EXPR<long unsigned int>(_2); _25 = VIEW_CONVERT_EXPR<long unsigned int>(prephitmp_21); _26 = VIEW_CONVERT_EXPR<long unsigned int>(prephitmp_21); _27 = VIEW_CONVERT_EXPR<long unsigned int>(prephitmp_19); vect_cst__28 = {_27, _26, _25, _24}; vectp.6_29 = &f_8(D)->_IO_read_base; MEM[(char * *)vectp.6_29] = vect_cst__28; isn't good though the cost modeling looks reasonable (vector construction from scalar cost plus unaligned store cost). Now on x86_64 we construct the vector via the stack for some reason: _IO_new_file_overflow: .LFB0: .cfi_startproc movq 8(%rdi), %rax movq %rax, -16(%rsp) movq 64(%rdi), %rax cmpq %rax, -16(%rsp) je .L2 movq 16(%rdi), %xmm0 .L3: movq %xmm0, 8(%rdi) movhps -16(%rsp), %xmm0 movups %xmm0, 24(%rdi) movq -16(%rsp), %xmm0 movq %rax, -16(%rsp) movhps -16(%rsp), %xmm0 movzbl %sil, %eax movups %xmm0, 40(%rdi) ret .L2: movq 56(%rdi), %rcx movq %rcx, -16(%rsp) movq -16(%rsp), %xmm0 punpcklqdq %xmm0, %xmm0 movups %xmm0, 8(%rdi) movq -16(%rsp), %xmm0 jmp .L3 in the end it's a matter of properly cost-modelling this and not making a mess out of it during RTL expansion / optimization.