https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96166
--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- That is what happens on the trunk (the revision that introduced didn't do that yet). But even that permutation is more expensive than the rotate, rolq $32, (%rdi) vs. movq (%rdi), %xmm1 pshufd $225, %xmm1, %xmm0 movq %xmm0, (%rdi) At least for code size...