https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017

--- Comment #13 from Alexander Peslyak <solar-gcc at openwall dot com> ---
(In reply to Richard Biener from comment #11)
> We are putting quite heavy register-pressure on the thing by means of
> partial redundancy elimination, thus disabling PRE using -fno-tree-pre
> might help (we still spill a lot).

It looks like -fno-tree-pre or equivalent was implied in the options I was
using, which were "-O2 -fomit-frame-pointer -Os -funroll-loops
-finline-functions" - yes, with -Os added after -O2 when compiling this
specific source file.  IIRC, this was experimentally derived as producing best
performance with 4.6.x or older.  Adding -fno-tree-pre after all of these
options merely changes the label names in the generated assembly code, while
resulting in identical object files (and obviously no performance change). 
Also, I now realize -Os was probably the reason why GCC preferred SSE
"floating-point" bitwise ops and MOVs here, instead of SSE2's integer ones
(they have longer encodings). Omitting -Os results in usage of the SSE2
instructions (both bitwise and MOVs), with correspondingly larger code. And
yes, when I omit -Os, I do need to add -fno-tree-pre to regain roughly the same
performance, and then to s/movdqu/movdqa/g to regain almost the full speed
(movdqu is just as slow as movups on this CPU). I've just tested all of this
with GCC 4.8.4 to possibly match yours (you mentioned you used 4.8). So I think
you uncovered yet another performance regression I had already worked around
with -Os.

FWIW, here are the generated assembly code sizes ("wc" output) with GCC 4.8.4:

-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions
  5870  17420 137636 1.s
-O2 -fomit-frame-pointer -Os -funroll-loops -finline-functions -fno-tree-pre
  5870  17420 137636 2.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions
  6814  20193 156837 a.s
-O2 -fomit-frame-pointer -funroll-loops -finline-functions -fno-tree-pre
  6028  17842 138284 b.s

As you can see, -fno-tree-pre reduces the size almost to the -Os level. (But
the .text size would be significantly larger because of the SSE2 instruction
encodings.  This is why I show the assembly code sizes for this comparison.)

Reply via email to