https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79745
--- Comment #2 from Venkataramanan <venkataramanan.kumar at amd dot com> --- I checked -mprefer-avx128 vs -mno-prefer-avx256. with AVX256 assembly generated with 32 inserts and 28 packs for loading each char type element for forming a vectors with YMM. instead doing loading from memory as 128 bits and doing 1 inserts to upper half ymm helps. vmovups (%rdi), %xmm0 vmovups (%rdx), %xmm1 incl %eax addq %rsi, %rdi addq %rcx, %rdx vmovups (%rdi), %xmm4 vmovups (%rdx), %xmm5 vinserti128 $0x1, %xmm4, %ymm0, %ymm3 vinserti128 $0x1, %xmm5, %ymm1, %ymm1 vpsadbw %ymm1, %ymm3, %ymm3 vpaddd %ymm3, %ymm2, %ymm2