https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79745
--- Comment #2 from Venkataramanan <venkataramanan.kumar at amd dot com> ---
I checked -mprefer-avx128 vs -mno-prefer-avx256.
with AVX256 assembly generated with 32 inserts and 28 packs for loading each
char type element for forming a vectors with YMM.
instead doing loading from memory as 128 bits and doing 1 inserts to upper half
ymm helps.
vmovups (%rdi), %xmm0
vmovups (%rdx), %xmm1
incl %eax
addq %rsi, %rdi
addq %rcx, %rdx
vmovups (%rdi), %xmm4
vmovups (%rdx), %xmm5
vinserti128 $0x1, %xmm4, %ymm0, %ymm3
vinserti128 $0x1, %xmm5, %ymm1, %ymm1
vpsadbw %ymm1, %ymm3, %ymm3
vpaddd %ymm3, %ymm2, %ymm2