https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017
--- Comment #13 from Alexander Peslyak <solar-gcc at openwall dot com> --- (In reply to Richard Biener from comment #11) > We are putting quite heavy register-pressure on the thing by means of > partial redundancy elimination, thus disabling PRE using -fno-tree-pre > might help (we still spill a lot). It looks like -fno-tree-pre or equivalent was implied in the options I was using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions" - yes, with -Os added after -O2 when compiling this specific source file. IIRC, this was experimentally derived as producing best performance with 4.6.x or older. Adding -fno-tree-pre after all of these options merely changes the label names in the generated assembly code, while resulting in identical object files (and obviously no performance change). Also, I now realize -Os was probably the reason why GCC preferred SSE "floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones (they have longer encodings). Omitting -Os results in usage of the SSE2 instructions (both bitwise and MOVs), with correspondingly larger code. And yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same performance, and then to s/movdqu/movdqa/g to regain almost the full speed (movdqu is just as slow as movups on this CPU). I've just tested all of this with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think you uncovered yet another performance regression I had already worked around with -Os. FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4: -O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions 5870 17420 137636 1.s -O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre 5870 17420 137636 2.s -O2 -fomit-frame-pointer -funroll-loops -finline-functions 6814 20193 156837 a.s -O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre 6028 17842 138284 b.s As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But the .text size would be significantly larger because of the SSE2 instruction encodings. This is why I show the assembly code sizes for this comparison.)