https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #12 from ktkachov at gcc dot gnu.org --- (In reply to ktkachov from comment #11) > > As an experiment I hacked the AArch64 assembly of the function generated > with -funroll-loops to replace the peeled prologue version with a simple > non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: > >7%. > > So beyond the vectorisation point Richard S. made above, maybe it's worth > considering replacing the peeled prologue with a simple loop instead? > Or at least add that as a distinct unrolling strategy and work to come up > with an analysis that would allow us to choose one over the other? Upon reflection I think I may have bungled up the assembly hacking (the changes I made may not be equivalent to the source). I'll redo that experiment soon, so please disregard that part for now. The iteration count distribution numbers are still valid though.