[Bug tree-optimization/60575] inefficient vectorization of compare into bytes on amd64

pinskia at gcc dot gnu.org via Gcc-bugs Sun, 15 Aug 2021 15:52:05 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60575


--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
We produce now since GCC 5+:
.L4:
        movdqu  (%rsi,%rax,2), %xmm0
        movdqu  16(%rsi,%rax,2), %xmm1
        pcmpgtw %xmm4, %xmm0
        pcmpgtw %xmm4, %xmm1
        pand    %xmm3, %xmm0
        pand    %xmm3, %xmm1
        pand    %xmm2, %xmm0
        pand    %xmm2, %xmm1
        packuswb        %xmm1, %xmm0
        movups  %xmm0, (%rdi,%rax)
        addq    $16, %rax
        cmpq    $1024, %rax
        jne     .L4

Note I removed __builtin_assume_aligned.

Also I note there are two extra pand's.  The second pand is not needed.

[Bug tree-optimization/60575] inefficient vectorization of compare into bytes on amd64

Reply via email to