https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102811

--- Comment #7 from Uroš Bizjak <ubizjak at gmail dot com> ---
The improvement with patch from comment #6:

The testcase:

_Float16 test (_Float16 a, _Float16 b)
{
  return a + b;
}

compiles with unpatched gcc -O2 -mf16c to:

        vmovss  %xmm0, %xmm0, %xmm2     # 27    [c=4 l=4]  *movhf_internal/3
        pextrw  $0, %xmm1, -4(%rsp)     # 28    [c=4 l=6]  *movhf_internal/5
        vpxor   %xmm0, %xmm0, %xmm0     # 7     [c=4 l=4]  movv8hf_internal/0
        vpxor   %xmm1, %xmm1, %xmm1     # 11    [c=4 l=4]  movv8hf_internal/0
        pextrw  $0, %xmm2, -2(%rsp)     # 30    [c=4 l=6]  *movhf_internal/5
        vpinsrw $0, -4(%rsp), %xmm1, %xmm1      # 12    [c=4 l=8] 
sse4_1_pinsrph/3
        vpinsrw $0, -2(%rsp), %xmm0, %xmm0      # 8     [c=4 l=8] 
sse4_1_pinsrph/3
        vcvtph2ps       %xmm1, %xmm1    # 13    [c=4 l=4]  vcvtph2ps
        vcvtph2ps       %xmm0, %xmm0    # 9     [c=4 l=4]  vcvtph2ps
        vaddss  %xmm1, %xmm0, %xmm0     # 15    [c=12 l=4]  *fop_sf_comm/2
        vinsertps       $0xe, %xmm0, %xmm0, %xmm0       # 17    [c=4 l=4] 
vec_setv4sf_0/2
        vcvtps2ph       $4, %xmm0, %xmm0        # 18    [c=4 l=4]  *vcvtps2ph
        ret             # 35    [c=0 l=1]  simple_return_internal

with unpatched gcc -O2 -mf16c -mavx2:

        vpbroadcastw    %xmm0, %xmm0    # 8     [c=4 l=5]  *vec_dupv8hf/1
        vpxor   %xmm2, %xmm2, %xmm2     # 7     [c=4 l=4]  movv8hf_internal/0
        vpbroadcastw    %xmm1, %xmm1    # 13    [c=4 l=5]  *vec_dupv8hf/1
        vpblendw        $1, %xmm0, %xmm2, %xmm2 # 9     [c=4 l=6] 
sse4_1_pblendph/2
        vpxor   %xmm0, %xmm0, %xmm0     # 12    [c=4 l=4]  movv8hf_internal/0
        vpblendw        $1, %xmm1, %xmm0, %xmm0 # 14    [c=4 l=6] 
sse4_1_pblendph/2
        vcvtph2ps       %xmm2, %xmm2    # 10    [c=4 l=4]  vcvtph2ps
        vcvtph2ps       %xmm0, %xmm0    # 15    [c=4 l=4]  vcvtph2ps
        vaddss  %xmm0, %xmm2, %xmm0     # 17    [c=12 l=4]  *fop_sf_comm/2
        vinsertps       $0xe, %xmm0, %xmm0, %xmm0       # 19    [c=4 l=4] 
vec_setv4sf_0/2
        vcvtps2ph       $4, %xmm0, %xmm0        # 20    [c=4 l=4]  *vcvtps2ph
        ret             # 36    [c=0 l=1]  simple_return_internal

And with patched gcc -O2 -mf16c:

        vpxor   %xmm2, %xmm2, %xmm2     # 32    [c=4 l=4]  movv8hf_internal/0
        vpblendw        $1, %xmm0, %xmm2, %xmm0 # 9     [c=4 l=6] 
*vec_setv8hf_0/8
        vpblendw        $1, %xmm1, %xmm2, %xmm1 # 14    [c=4 l=6] 
*vec_setv8hf_0/8
        vcvtph2ps       %xmm1, %xmm1    # 15    [c=4 l=4]  vcvtph2ps
        vcvtph2ps       %xmm0, %xmm0    # 10    [c=4 l=4]  vcvtph2ps
        vaddss  %xmm1, %xmm0, %xmm0     # 17    [c=12 l=4]  *fop_sf_comm/2
        vinsertps       $0xe, %xmm0, %xmm0, %xmm0       # 19    [c=4 l=4] 
vec_setv4sf_0/2
        vcvtps2ph       $4, %xmm0, %xmm0        # 20    [c=4 l=4]  *vcvtps2ph
        ret             # 40    [c=0 l=1]  simple_return_internal

The above dumps show inconsistendy for PEXTRW (it should be VPEXTRW) and also
open a question, why unpatched gcc prefers memory temp instead of GPR temp for
PEXTRW/PINSRW.

The patch improves HI/HFmode inserts to element 0 in general.

Reply via email to