https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750
--- Comment #9 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Thiago Macieira from comment #0) > Testcase: ... > The assembly for this produces: > > vmovdqu16 (%rdi), %ymm1 > vmovdqu16 32(%rdi), %ymm2 > vpcmpuw $0, %ymm0, %ymm1, %k0 > vpcmpuw $0, %ymm0, %ymm2, %k1 > kmovw %k0, %edx > kmovw %k1, %eax > kortestw %k1, %k0 > je .L10 > > Those two KMOVW instructions aren't required for the check that follows. > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST > can't be dispatched until those two have executed, thus introducing a > 2-cycle delay in this loop. These are not NOP moves but zero-extensions. vmovdqu16 (%rdi), %ymm1 # 93 [c=17 l=6] movv16hi_internal/2 vmovdqu16 32(%rdi), %ymm2 # 94 [c=21 l=7] movv16hi_internal/2 vpcmpuw $0, %ymm0, %ymm1, %k0 # 21 [c=4 l=7] avx512vl_ucmpv16hi3 vpcmpuw $0, %ymm0, %ymm2, %k1 # 27 [c=4 l=7] avx512vl_ucmpv16hi3 kmovw %k0, %edx # 30 [c=4 l=4] *zero_extendhisi2/1 kmovw %k1, %eax # 29 [c=4 l=4] *zero_extendhisi2/1 kortestw %k1, %k0 # 31 [c=4 l=4] kortesthi since for some reason tree optimizers give us: _28 = VIEW_CONVERT_EXPR<__v16hi>(_31); _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535); _26 = VIEW_CONVERT_EXPR<__v16hi>(_30); _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535); _2 = (int) _27; _3 = (int) _29; _15 = __builtin_ia32_kortestzhi (_3, _2); > Clang generates: > > .LBB0_2: # =>This Inner Loop Header: Depth=1 > vpcmpeqw (%rdi), %ymm0, %k0 > vpcmpeqw 32(%rdi), %ymm0, %k1 > kortestw %k0, %k1 > jne .LBB0_3 > > ICC inserts one KMOVW, but not the other. > > Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M > > LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78 > It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC > code is 3 cycles/loop. > > LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles > per loop, because it can run in parallel with the second load, given that > the loads are ports 2 and 3.