https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
--- Comment #5 from Wilco <wilco at gcc dot gnu.org> --- (In reply to Wilco from comment #4) > (In reply to ktkachov from comment #2) > > Created attachment 45386 [details] > > aarch64-llvm output with -Ofast -mcpu=cortex-a57 > > > > I'm attaching the full LLVM aarch64 output. > > > > The output you quoted is with -funroll-loops. If that's not given, GCC > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my > > testing). > > > > Is there anything we can do to make the default unrolling a bit more > > aggressive? > > I don't think the RTL unroller works at all. It doesn't have the right > settings, and doesn't understand how to unroll, so we always get inefficient > and bloated code. > > To do unrolling correctly it has to be integrated at tree level - for > example when vectorization isn't possible/beneficial, unrolling might still > be a good idea. To add some numbers to the conversation, the gain LLVM gets from default unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017. This clearly shows there is huge potential from unrolling, *if* we can teach GCC to unroll properly like LLVM. That means early unrolling, using good default settings and using a trailing loop rather than inefficient peeling.