https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750
--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Uroš Bizjak from comment #9) > (In reply to Thiago Macieira from comment #0) > > Testcase: > ... > > The assembly for this produces: > > > > vmovdqu16 (%rdi), %ymm1 > > vmovdqu16 32(%rdi), %ymm2 > > vpcmpuw $0, %ymm0, %ymm1, %k0 > > vpcmpuw $0, %ymm0, %ymm2, %k1 > > kmovw %k0, %edx > > kmovw %k1, %eax > > kortestw %k1, %k0 > > je .L10 > > > > Those two KMOVW instructions aren't required for the check that follows. > > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST > > can't be dispatched until those two have executed, thus introducing a > > 2-cycle delay in this loop. > > These are not NOP moves but zero-extensions. > > vmovdqu16 (%rdi), %ymm1 # 93 [c=17 l=6] > movv16hi_internal/2 > vmovdqu16 32(%rdi), %ymm2 # 94 [c=21 l=7] > movv16hi_internal/2 > vpcmpuw $0, %ymm0, %ymm1, %k0 # 21 [c=4 l=7] > avx512vl_ucmpv16hi3 > vpcmpuw $0, %ymm0, %ymm2, %k1 # 27 [c=4 l=7] > avx512vl_ucmpv16hi3 > kmovw %k0, %edx # 30 [c=4 l=4] *zero_extendhisi2/1 > kmovw %k1, %eax # 29 [c=4 l=4] *zero_extendhisi2/1 > kortestw %k1, %k0 # 31 [c=4 l=4] kortesthi > > since for some reason tree optimizers give us: > > _28 = VIEW_CONVERT_EXPR<__v16hi>(_31); > _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535); > _26 = VIEW_CONVERT_EXPR<__v16hi>(_30); > _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535); > _2 = (int) _27; > _3 = (int) _29; > _15 = __builtin_ia32_kortestzhi (_3, _2); > > Is there any way to avoid zero_extension for > _2 = (int) _27; > _3 = (int) _29; Since __builtin_ia32_kortestzhi is defined to accept 2 short parameters. Also ABI doesn't ask for clearing the upper bits. i.e. for testcase int __attribute__((noipa)) foo (short a) { return a; } int foo1 (short a) { return foo (a); } _Z3foos: movswl %di, %eax ret _Z4foo1s: movswl %di, %edi jmp _Z3foos movswl in foo1 seems redundant.