https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118984
Peter Cordes <pcordes at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |pcordes at gmail dot com --- Comment #3 from Peter Cordes <pcordes at gmail dot com> --- Test-case with a loop that inlines the function: https://godbolt.org/z/Tcb3GGvYf I was also hoping that it was only a problem with hard-register constraints, but unfortunately it's a real problem. I'm sure it would be easy to construct a test case that's front-end bottlenecked (with a mix of ALU and L1-hit loads and stores) so the extra uop from an eliminated vmovdqua does matter. Especially on Skylake or older (4-wide pipeline), but probably doable for ICL or Alder Lake. Or on AMD, a mix of load, SIMD-ALU, scalar-ALU, and loads since it has separate SIMD schedulers and execution ports. Fewer ROB entries for the same work also lets out-of-order exec see farther. And lets the loop branch execute farther ahead of the rest of the loop if it's a simple counter and doesn't have competition for port 6 on Intel, allowing the loop-exit branch mispredict recovery to happen earlier, while the loop body still has more cycles of work to do keeping the execution units busy to hide the bubble before we get the work after the loop into the back-end. So it certainly *can* matter. Often these are pretty minor effects for long-running loops unless the loop actually is front-end bottlenecked, in which case it could be like 12.5% or more faster to have an 8 uop loop instead of 9. I don't know how often that happens in practice in real programs. Probably not very. Also for hyperthreading friendliness: needing fewer cycles to issue the same amount of work leaves more for the other logical core.