http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60539
--- Comment #2 from chrbr at gcc dot gnu.org --- note also that instead of merging the 3 max remaining bytes after the world-at-time loop with the byte-at-a-time loop I had a version that unrolled the 3 of them of we have 2 different path (word,bytes) instead of 3 (words, words+byte remaining, bytes). But the additional CF complexity was not beneficial in average for my set of benchmarks compared to a simple version with the remaining bytes falling thru the byte-at-atime loop