https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #9 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Thiago Macieira from comment #0)
> Testcase:
...
> The assembly for this produces:
> 
>         vmovdqu16       (%rdi), %ymm1
>         vmovdqu16       32(%rdi), %ymm2
>         vpcmpuw $0, %ymm0, %ymm1, %k0
>         vpcmpuw $0, %ymm0, %ymm2, %k1
>         kmovw   %k0, %edx
>         kmovw   %k1, %eax
>         kortestw        %k1, %k0
>         je      .L10
> 
> Those two KMOVW instructions aren't required for the check that follows.
> They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> can't be dispatched until those two have executed, thus introducing a
> 2-cycle delay in this loop.

These are not NOP moves but zero-extensions.

        vmovdqu16       (%rdi), %ymm1   # 93    [c=17 l=6]  movv16hi_internal/2
        vmovdqu16       32(%rdi), %ymm2 # 94    [c=21 l=7]  movv16hi_internal/2
        vpcmpuw $0, %ymm0, %ymm1, %k0   # 21    [c=4 l=7]  avx512vl_ucmpv16hi3
        vpcmpuw $0, %ymm0, %ymm2, %k1   # 27    [c=4 l=7]  avx512vl_ucmpv16hi3
        kmovw   %k0, %edx       # 30    [c=4 l=4]  *zero_extendhisi2/1
        kmovw   %k1, %eax       # 29    [c=4 l=4]  *zero_extendhisi2/1
        kortestw        %k1, %k0        # 31    [c=4 l=4]  kortesthi

since for some reason tree optimizers give us:

  _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
  _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
  _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
  _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
  _2 = (int) _27;
  _3 = (int) _29;
  _15 = __builtin_ia32_kortestzhi (_3, _2);


> Clang generates:
> 
> .LBB0_2:                                # =>This Inner Loop Header: Depth=1
>         vpcmpeqw        (%rdi), %ymm0, %k0
>         vpcmpeqw        32(%rdi), %ymm0, %k1
>         kortestw        %k0, %k1
>         jne     .LBB0_3
> 
> ICC inserts one KMOVW, but not the other.
> 
> Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M
> 
> LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
> It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
> code is 3 cycles/loop.
> 
> LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles
> per loop, because it can run in parallel with the second load, given that
> the loads are ports 2 and 3.

Reply via email to