https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #10 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Uroš Bizjak from comment #9)
> (In reply to Thiago Macieira from comment #0)
> > Testcase:
> ...
> > The assembly for this produces:
> > 
> >         vmovdqu16       (%rdi), %ymm1
> >         vmovdqu16       32(%rdi), %ymm2
> >         vpcmpuw $0, %ymm0, %ymm1, %k0
> >         vpcmpuw $0, %ymm0, %ymm2, %k1
> >         kmovw   %k0, %edx
> >         kmovw   %k1, %eax
> >         kortestw        %k1, %k0
> >         je      .L10
> > 
> > Those two KMOVW instructions aren't required for the check that follows.
> > They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
> > can't be dispatched until those two have executed, thus introducing a
> > 2-cycle delay in this loop.
> 
> These are not NOP moves but zero-extensions.
> 
>         vmovdqu16       (%rdi), %ymm1   # 93    [c=17 l=6] 
> movv16hi_internal/2
>         vmovdqu16       32(%rdi), %ymm2 # 94    [c=21 l=7] 
> movv16hi_internal/2
>         vpcmpuw $0, %ymm0, %ymm1, %k0   # 21    [c=4 l=7] 
> avx512vl_ucmpv16hi3
>         vpcmpuw $0, %ymm0, %ymm2, %k1   # 27    [c=4 l=7] 
> avx512vl_ucmpv16hi3
>         kmovw   %k0, %edx       # 30    [c=4 l=4]  *zero_extendhisi2/1
>         kmovw   %k1, %eax       # 29    [c=4 l=4]  *zero_extendhisi2/1
>         kortestw        %k1, %k0        # 31    [c=4 l=4]  kortesthi
> 
> since for some reason tree optimizers give us:
> 
>   _28 = VIEW_CONVERT_EXPR<__v16hi>(_31);
>   _29 = __builtin_ia32_ucmpw256_mask (_28, _20, 0, 65535);
>   _26 = VIEW_CONVERT_EXPR<__v16hi>(_30);
>   _27 = __builtin_ia32_ucmpw256_mask (_26, _20, 0, 65535);
>   _2 = (int) _27;
>   _3 = (int) _29;
>   _15 = __builtin_ia32_kortestzhi (_3, _2);
> 
> 

Is there any way to avoid zero_extension for
>   _2 = (int) _27;
>   _3 = (int) _29;

Since __builtin_ia32_kortestzhi is defined to accept 2 short parameters. Also
ABI doesn't ask for clearing the upper bits.

i.e. for testcase
int
__attribute__((noipa))
foo (short a)
{
    return a;
}

int
foo1 (short a)
{
    return foo (a);
}


_Z3foos:
        movswl  %di, %eax
        ret
_Z4foo1s:
        movswl  %di, %edi
        jmp     _Z3foos


movswl in foo1 seems redundant.

Reply via email to