https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- With -mtune=core-avx2 we do vmovups (%rdi), %xmm1 vmovups (%rdi), %ymm3 ... vextractf128 $0x1, %ymm3, %xmm1 with -mtune=intel the even more weird vmovups (%rdi), %xmm1 addq $32, %rdi vmovups -32(%rdi), %ymm3 ... vextractf128 $0x1, %ymm3, %xmm1 I guess at runtime the vectorized variant isn't so much worse if not because of the loop size growth. So an additional "weight" we could put into the generic vectorizer cost metric would be the number of stmts generated - that is, computing an effective unroll factor and applying unroll limits to that. In this case we'd do 8-times unrolling (resulting loop body is twice as large compared to 8-unrolled scalar code).