https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
With -mtune=core-avx2 we do
vmovups (%rdi), %xmm1
vmovups (%rdi), %ymm3
...
vextractf128 $0x1, %ymm3, %xmm1
with -mtune=intel the even more weird
vmovups (%rdi), %xmm1
addq $32, %rdi
vmovups -32(%rdi), %ymm3
...
vextractf128 $0x1, %ymm3, %xmm1
I guess at runtime the vectorized variant isn't so much worse if not
because of the loop size growth. So an additional "weight" we could
put into the generic vectorizer cost metric would be the number of
stmts generated - that is, computing an effective unroll factor and
applying unroll limits to that. In this case we'd do 8-times unrolling
(resulting loop body is twice as large compared to 8-unrolled scalar code).