https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87077

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
We are now vectorizing the outer loop with the inner loop being unrolled.

If you add #pragma GCC unroll 0 to the inner loop we get comparatively good
code, but we reduce to scalar 4 times.

If you add #pragma GCC unroll 4 to both loops we apply BB vectorization
which expands the reductions in suboptimal way - it now also detects the
reductions but they are covered by the BB vectorization we recognize
for the store of the reduction results.

Note haddp[sd] is slow.

Reply via email to