https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102811
--- Comment #7 from Uroš Bizjak <ubizjak at gmail dot com> --- The improvement with patch from comment #6: The testcase: _Float16 test (_Float16 a, _Float16 b) { return a + b; } compiles with unpatched gcc -O2 -mf16c to: vmovss %xmm0, %xmm0, %xmm2 # 27 [c=4 l=4] *movhf_internal/3 pextrw $0, %xmm1, -4(%rsp) # 28 [c=4 l=6] *movhf_internal/5 vpxor %xmm0, %xmm0, %xmm0 # 7 [c=4 l=4] movv8hf_internal/0 vpxor %xmm1, %xmm1, %xmm1 # 11 [c=4 l=4] movv8hf_internal/0 pextrw $0, %xmm2, -2(%rsp) # 30 [c=4 l=6] *movhf_internal/5 vpinsrw $0, -4(%rsp), %xmm1, %xmm1 # 12 [c=4 l=8] sse4_1_pinsrph/3 vpinsrw $0, -2(%rsp), %xmm0, %xmm0 # 8 [c=4 l=8] sse4_1_pinsrph/3 vcvtph2ps %xmm1, %xmm1 # 13 [c=4 l=4] vcvtph2ps vcvtph2ps %xmm0, %xmm0 # 9 [c=4 l=4] vcvtph2ps vaddss %xmm1, %xmm0, %xmm0 # 15 [c=12 l=4] *fop_sf_comm/2 vinsertps $0xe, %xmm0, %xmm0, %xmm0 # 17 [c=4 l=4] vec_setv4sf_0/2 vcvtps2ph $4, %xmm0, %xmm0 # 18 [c=4 l=4] *vcvtps2ph ret # 35 [c=0 l=1] simple_return_internal with unpatched gcc -O2 -mf16c -mavx2: vpbroadcastw %xmm0, %xmm0 # 8 [c=4 l=5] *vec_dupv8hf/1 vpxor %xmm2, %xmm2, %xmm2 # 7 [c=4 l=4] movv8hf_internal/0 vpbroadcastw %xmm1, %xmm1 # 13 [c=4 l=5] *vec_dupv8hf/1 vpblendw $1, %xmm0, %xmm2, %xmm2 # 9 [c=4 l=6] sse4_1_pblendph/2 vpxor %xmm0, %xmm0, %xmm0 # 12 [c=4 l=4] movv8hf_internal/0 vpblendw $1, %xmm1, %xmm0, %xmm0 # 14 [c=4 l=6] sse4_1_pblendph/2 vcvtph2ps %xmm2, %xmm2 # 10 [c=4 l=4] vcvtph2ps vcvtph2ps %xmm0, %xmm0 # 15 [c=4 l=4] vcvtph2ps vaddss %xmm0, %xmm2, %xmm0 # 17 [c=12 l=4] *fop_sf_comm/2 vinsertps $0xe, %xmm0, %xmm0, %xmm0 # 19 [c=4 l=4] vec_setv4sf_0/2 vcvtps2ph $4, %xmm0, %xmm0 # 20 [c=4 l=4] *vcvtps2ph ret # 36 [c=0 l=1] simple_return_internal And with patched gcc -O2 -mf16c: vpxor %xmm2, %xmm2, %xmm2 # 32 [c=4 l=4] movv8hf_internal/0 vpblendw $1, %xmm0, %xmm2, %xmm0 # 9 [c=4 l=6] *vec_setv8hf_0/8 vpblendw $1, %xmm1, %xmm2, %xmm1 # 14 [c=4 l=6] *vec_setv8hf_0/8 vcvtph2ps %xmm1, %xmm1 # 15 [c=4 l=4] vcvtph2ps vcvtph2ps %xmm0, %xmm0 # 10 [c=4 l=4] vcvtph2ps vaddss %xmm1, %xmm0, %xmm0 # 17 [c=12 l=4] *fop_sf_comm/2 vinsertps $0xe, %xmm0, %xmm0, %xmm0 # 19 [c=4 l=4] vec_setv4sf_0/2 vcvtps2ph $4, %xmm0, %xmm0 # 20 [c=4 l=4] *vcvtps2ph ret # 40 [c=0 l=1] simple_return_internal The above dumps show inconsistendy for PEXTRW (it should be VPEXTRW) and also open a question, why unpatched gcc prefers memory temp instead of GPR temp for PEXTRW/PINSRW. The patch improves HI/HFmode inserts to element 0 in general.