https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31667
--- Comment #6 from Allan Jensen <linux at carewolf dot com> --- (In reply to Andrew Pinski from comment #5) > We produce this now: > > movdqa x(%rip), %xmm1 > pxor %xmm0, %xmm0 > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > movaps %xmm1, y+16(%rip) > movdqa x+16(%rip), %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm2, y(%rip) > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > movaps %xmm1, y+48(%rip) > movdqa x+32(%rip), %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm2, y+32(%rip) > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > movaps %xmm1, y+80(%rip) > movdqa x+48(%rip), %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm2, y+64(%rip) > movdqa %xmm1, %xmm2 > punpckhbw %xmm0, %xmm1 > punpcklbw %xmm0, %xmm2 > movaps %xmm1, y+112(%rip) > movaps %xmm2, y+96(%rip) > > And even ICC produce a similar thing except scheduled differently. I hope that is because you forgot -msse4.1?