https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84712
--- Comment #1 from Marc Glisse <glisse at gcc dot gnu.org> --- We unroll quite late (cunroll) and there aren't any passes (like FRE) after that to do the propagation. Adding #pragma GCC unroll 16 before the loop lets it optimize.