https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984

Peter Cordes <pcordes at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |pcordes at gmail dot com

--- Comment #3 from Peter Cordes <pcordes at gmail dot com> ---
Test-case with a loop that inlines the function:
https://godbolt.org/z/Tcb3GGvYf
I was also hoping that it was only a problem with hard-register constraints,
but unfortunately it's a real problem.

I'm sure it would be easy to construct a test case that's front-end
bottlenecked (with a mix of ALU and L1-hit loads and stores) so the extra uop
from an eliminated vmovdqua does matter.  Especially on Skylake or older
(4-wide pipeline), but probably doable for ICL or Alder Lake.
Or on AMD, a mix of load, SIMD-ALU, scalar-ALU, and loads since it has separate
SIMD schedulers and execution ports.

Fewer ROB entries for the same work also lets out-of-order exec see farther. 
And lets the loop branch execute farther ahead of the rest of the loop if it's
a simple counter and doesn't have competition for port 6 on Intel, allowing the
loop-exit branch mispredict recovery to happen earlier, while the loop body
still has more cycles of work to do keeping the execution units busy to hide
the bubble before we get the work after the loop into the back-end.

So it certainly *can* matter.  Often these are pretty minor effects for
long-running loops unless the loop actually is front-end bottlenecked, in which
case it could be like 12.5% or more faster to have an 8 uop loop instead of 9. 
I don't know how often that happens in practice in real programs.  Probably not
very.

Also for hyperthreading friendliness: needing fewer cycles to issue the same
amount of work leaves more for the other logical core.

Reply via email to