https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077
--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- We are now vectorizing the outer loop with the inner loop being unrolled. If you add #pragma GCC unroll 0 to the inner loop we get comparatively good code, but we reduce to scalar 4 times. If you add #pragma GCC unroll 4 to both loops we apply BB vectorization which expands the reductions in suboptimal way - it now also detects the reductions but they are covered by the BB vectorization we recognize for the store of the reduction results. Note haddp[sd] is slow.