https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #13 from Hongtao.liu <crazylht at gmail dot com> ---
Created attachment 52031
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52031&action=edit
untested patch.

Attached patch can optimize #c0 to

        vmovdqu (%rdi), %ymm1
        vmovdqu16       32(%rdi), %ymm2
        vpcmpuw $0, %ymm0, %ymm1, %k1
        vpcmpuw $0, %ymm0, %ymm2, %k0
        kmovw   %k1, %k2
        kortestw        %k0, %k1
        je      .L10


and #c6 to

.L4:
        vmovdqu (%rdi), %ymm2
        vmovdqu 32(%rdi), %ymm1
        vpcmpuw $0, %ymm0, %ymm2, %k3
        vpcmpuw $0, %ymm0, %ymm1, %k0
        kmovw   %k3, %k1
        kmovw   %k0, %k2
        kortestd        %k2, %k1
        je      .L10


It should be much better than orginal version, but still a little suboptimal:
the frist kmovw should be sinked to the exit edge, the latter 2 kmovw should be
emilated.

Reply via email to