https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #12 from ktkachov at gcc dot gnu.org ---
(In reply to ktkachov from comment #11)
> 
> As an experiment I hacked the AArch64 assembly of the function generated
> with -funroll-loops to replace the peeled prologue version with a simple
> non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms:
> >7%.
> 
> So beyond the vectorisation point Richard S. made above, maybe it's worth
> considering replacing the peeled prologue with a simple loop instead?
> Or at least add that as a distinct unrolling strategy and work to come up
> with an analysis that would allow us to choose one over the other?

Upon reflection I think I may have bungled up the assembly hacking (the changes
I made may not be equivalent to the source). I'll redo that experiment soon, so
please disregard that part for now. The iteration count distribution numbers
are still valid though.

Reply via email to