https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #22 from Wilco <wdijkstr at arm dot com> --- (In reply to Evandro from comment #21) > (In reply to ramana.radhakrish...@arm.com from comment #20) > > What's the kind of performance delta you see if you managed to unroll > > the loop just a wee bit ? Probably not much looking at the code produced > > here. > > Comparing the cycle counts on Juno when running the program from the matrix > multiplication test above built with -Ofast and unrolling: > > -fno-unroll-loops: 592000 > -funroll-loops --param max-unroll-times=2: 594000 > -funroll-loops --param max-unroll-times=4: 592000 > -funroll-loops: 590000 (implies --param max-unroll-times=8) > -funroll-loops --param max-unroll-times=16: 581000 > > It seems to me that without effective iv-opt in place, loops have to be > unrolled too aggressively to make any difference in this case, greatly > sacrificing code size. Unrolling alone isn't good enough in sum reductions. As I mentioned before, GCC doesn't enable any of the useful loop optimizations by default. So add -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again these are all generic GCC issues.