https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96933
--- Comment #4 from Segher Boessenkool <segher at gcc dot gnu.org> --- Yes, timing suggests there is some SHL/LHS flush. On p9 and later we can use mtvsrdd instead of mtvsrd (moving two bytes into place at one), which reduces the number of moves from 16 to 8, and the number of merges from 15 to 7 (and reduces path length by 1). This sounds like a no-brainer win with that :-)