https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
Andrew Senkevich <andrew.n.senkevich at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |andrew.n.senkevich at gmail dot co | |m --- Comment #2 from Andrew Senkevich <andrew.n.senkevich at gmail dot com> --- Currently -mprefer-avx256 is default for SKX and vzeroupper addition was fixed, code generated is: .L3: vpsrlw $8, (%rsi,%rax,2), %ymm0 vpsrlw $8, 32(%rsi,%rax,2), %ymm1 vpand %ymm0, %ymm2, %ymm0 vpand %ymm1, %ymm2, %ymm1 vpackuswb %ymm1, %ymm0, %ymm0 vpermq $216, %ymm0, %ymm0 vmovdqu8 %ymm0, (%rdi,%rax) addq $32, %rax cmpq %rax, %rdx jne .L3 vmovdqu8 remains but I cannot confirm it is slower.