https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89049

--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> ---
With -mtune=core-avx2 we do

        vmovups (%rdi), %xmm1
        vmovups (%rdi), %ymm3
...
        vextractf128    $0x1, %ymm3, %xmm1

with -mtune=intel the even more weird

        vmovups (%rdi), %xmm1
        addq    $32, %rdi
        vmovups -32(%rdi), %ymm3
...
        vextractf128    $0x1, %ymm3, %xmm1

I guess at runtime the vectorized variant isn't so much worse if not
because of the loop size growth.  So an additional "weight" we could
put into the generic vectorizer cost metric would be the number of
stmts generated - that is, computing an effective unroll factor and
applying unroll limits to that.  In this case we'd do 8-times unrolling
(resulting loop body is twice as large compared to 8-unrolled scalar code).

Reply via email to