https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114814
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |fxue at os dot amperecomputing.com --- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- The issue is the high VF imposed on us and the required bool -> size_t conversion. What you get is of course massive parallelism. What hurts as well is the "linear" accumulation done of the vector IVs instead of having multiple accumulators or accumulating them in a tree. I think there's work to improve that part in progress. Using a widen-sum for part of the accumulation might be another improvement, currrently we fail here because QI -> DI widen sum isn't available but both SI -> DI widen sum with earlier QI -> SI widening or QI -> HI widen sum with later HI -> DI widening would be possible.