https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63503
--- Comment #24 from Wilco <wdijkstr at arm dot com> --- (In reply to Evandro from comment #23) > (In reply to Wilco from comment #22) > > Unrolling alone isn't good enough in sum reductions. As I mentioned before, > > GCC doesn't enable any of the useful loop optimizations by default. So add > > -fvariable-expansion-in-unroller to get a good speedup with unrolling. Again > > these are all generic GCC issues. > > Adding -fvariable-expansion-in-unroller when using -funroll-loops results in > practically the same code being emitted. Correct, all it does is cut the dependency chain of the accumulates. But that's enough to get the speedup.