https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60575
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
We produce now since GCC 5+:
.L4:
movdqu (%rsi,%rax,2), %xmm0
movdqu 16(%rsi,%rax,2), %xmm1
pcmpgtw %xmm4, %xmm0
pcmpgtw %xmm4, %xmm1
pand %xmm3, %xmm0
pand %xmm3, %xmm1
pand %xmm2, %xmm0
pand %xmm2, %xmm1
packuswb %xmm1, %xmm0
movups %xmm0, (%rdi,%rax)
addq $16, %rax
cmpq $1024, %rax
jne .L4
Note I removed __builtin_assume_aligned.
Also I note there are two extra pand's. The second pand is not needed.