https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116140
--- Comment #4 from Tamar Christina <tnfchris at gcc dot gnu.org> --- It looks like it's because the old unrolled code for the pointer version did a subtract and used the difference to optimize the IV check away to every 4 elements. This explains the increase in instruction count. I hadn't noticed it during benchmarking because on aarch64 the non-pointer version got recovered with cbz. This should be fixable while still being vectorizable with #pragma GCC unroll 4 on the loop. The generated code looks good, but it looks like the pragma is being dropped when used in the template. I'm away for a few days so Alex is looking into it.