https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79745

--- Comment #2 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
I checked -mprefer-avx128 vs -mno-prefer-avx256. 

with AVX256 assembly generated with 32 inserts and 28 packs for loading each
char type element for forming a vectors with YMM.

instead doing loading from memory as 128 bits and doing 1 inserts to upper half
ymm helps.

        vmovups (%rdi), %xmm0
        vmovups (%rdx), %xmm1
        incl    %eax
        addq    %rsi, %rdi
        addq    %rcx, %rdx
        vmovups (%rdi), %xmm4
        vmovups (%rdx), %xmm5
        vinserti128     $0x1, %xmm4, %ymm0, %ymm3
        vinserti128     $0x1, %xmm5, %ymm1, %ymm1
        vpsadbw %ymm1, %ymm3, %ymm3
        vpaddd  %ymm3, %ymm2, %ymm2

Reply via email to