https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84719
--- Comment #10 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Richard Biener from comment #9) > So with 2 bytes we get Try 3 bytes (the worst case). > Are you sure performance isn't dominated by the > first init loop (both GCC and clang vectorize it). Replacing memcpy(,,block) with memcpy(,,8) (the next line masks the other bytes anyway) gained a factor 8 in running time, when I tried the other day.