https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767
--- Comment #5 from Bill Schmidt <wschmidt at gcc dot gnu.org> --- From the original reporter: Partially unrolling the outermost loop in the innermost loop body enables data reuse for array A (see source) thereby improving the mem-ops/compute ratio and providing the performance gain.