https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100267
--- Comment #5 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Hongtao.liu from comment #4) > (In reply to Hongtao.liu from comment #3) > > After support v{,p}expand* thats w/o mask operands, codegen seems to be > > optimal > > > > I was wrong, without mask, it's just simple move. finally optimized to _Z16dummyf1_avx512x8PK11flow_avx512: .LFB5665: .cfi_startproc movl (%rdi), %edx movq 8(%rdi), %rax vmovdqu (%rax,%rdx,8), %ymm0 vmovdqu 32(%rax,%rdx,8), %ymm1 vpaddq %ymm1, %ymm0, %ymm0 ret I'm testing the patch.