https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140

--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
It looks like it's because the old unrolled code for the pointer version did a
subtract and used the difference to optimize the IV check away to every 4
elements.  This explains the increase in instruction count.

I hadn't noticed it during benchmarking because on aarch64 the non-pointer
version got recovered with cbz.

This should be fixable while still being vectorizable with

#pragma GCC unroll 4

on the loop.  The generated code looks good, but it looks like the pragma is
being
dropped when used in the template.

I'm away for a few days so Alex is looking into it.

Reply via email to