https://gcc.gnu.org/bugzilla/show_bug.cgi?id=60575
--- Comment #1 from Andrew Pinski <pinskia at gcc dot gnu.org> --- We produce now since GCC 5+: .L4: movdqu (%rsi,%rax,2), %xmm0 movdqu 16(%rsi,%rax,2), %xmm1 pcmpgtw %xmm4, %xmm0 pcmpgtw %xmm4, %xmm1 pand %xmm3, %xmm0 pand %xmm3, %xmm1 pand %xmm2, %xmm0 pand %xmm2, %xmm1 packuswb %xmm1, %xmm0 movups %xmm0, (%rdi,%rax) addq $16, %rax cmpq $1024, %rax jne .L4 Note I removed __builtin_assume_aligned. Also I note there are two extra pand's. The second pand is not needed.