https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80695

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
On x86_64 vectorization is not profitable, likely due to the higher cost of
unaligned vector stores?  But yes, I can see that vectorizing it as

  _24 = VIEW_CONVERT_EXPR<long unsigned int>(_2);
  _25 = VIEW_CONVERT_EXPR<long unsigned int>(prephitmp_21);
  _26 = VIEW_CONVERT_EXPR<long unsigned int>(prephitmp_21);
  _27 = VIEW_CONVERT_EXPR<long unsigned int>(prephitmp_19);
  vect_cst__28 = {_27, _26, _25, _24};
  vectp.6_29 = &f_8(D)->_IO_read_base;
  MEM[(char * *)vectp.6_29] = vect_cst__28;

isn't good though the cost modeling looks reasonable (vector construction from
scalar cost plus unaligned store cost).  Now on x86_64 we construct the vector
via the stack for some reason:

_IO_new_file_overflow:
.LFB0:
        .cfi_startproc
        movq    8(%rdi), %rax
        movq    %rax, -16(%rsp)
        movq    64(%rdi), %rax
        cmpq    %rax, -16(%rsp)
        je      .L2
        movq    16(%rdi), %xmm0
.L3:
        movq    %xmm0, 8(%rdi)
        movhps  -16(%rsp), %xmm0
        movups  %xmm0, 24(%rdi)
        movq    -16(%rsp), %xmm0
        movq    %rax, -16(%rsp)
        movhps  -16(%rsp), %xmm0
        movzbl  %sil, %eax
        movups  %xmm0, 40(%rdi)
        ret
.L2:
        movq    56(%rdi), %rcx
        movq    %rcx, -16(%rsp)
        movq    -16(%rsp), %xmm0
        punpcklqdq      %xmm0, %xmm0
        movups  %xmm0, 8(%rdi)
        movq    -16(%rsp), %xmm0
        jmp     .L3

in the end it's a matter of properly cost-modelling this and not making a mess
out of it during RTL expansion / optimization.

Reply via email to