https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750
--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> --- Created attachment 52031 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52031&action=edit untested patch. Attached patch can optimize #c0 to vmovdqu (%rdi), %ymm1 vmovdqu16 32(%rdi), %ymm2 vpcmpuw $0, %ymm0, %ymm1, %k1 vpcmpuw $0, %ymm0, %ymm2, %k0 kmovw %k1, %k2 kortestw %k0, %k1 je .L10 and #c6 to .L4: vmovdqu (%rdi), %ymm2 vmovdqu 32(%rdi), %ymm1 vpcmpuw $0, %ymm0, %ymm2, %k3 vpcmpuw $0, %ymm0, %ymm1, %k0 kmovw %k3, %k1 kmovw %k0, %k2 kortestd %k2, %k1 je .L10 It should be much better than orginal version, but still a little suboptimal: the frist kmovw should be sinked to the exit edge, the latter 2 kmovw should be emilated.