https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121587
Bug ID: 121587 Summary: _Float16 vector operations should use addps/subps/mulps/divps if F16C is present Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: mkretz at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Test case (https://compiler-explorer.com/z/hfs4q7n66): using V [[gnu::vector_size(16)]] = _Float16; V add(V a, V b) { return a + b; } V sub(V a, V b) { return a - b; } V mul(V a, V b) { return a * b; } V div(V a, V b) { return a / b; } With F16C (x86-64-v3) and without AVX512-FP16, these operations should do: "add(_Float16 __vector(8), _Float16 __vector(8))": vcvtph2ps ymm2, xmm0 vcvtph2ps ymm0, xmm1 vaddps ymm0, ymm0, ymm2 vcvtps2ph xmm0, ymm0, 4 vzeroupper ret "sub(_Float16 __vector(8), _Float16 __vector(8))": vcvtph2ps ymm1, xmm1 vcvtph2ps ymm0, xmm0 vsubps ymm0, ymm0, ymm1 vcvtps2ph xmm0, ymm0, 4 vzeroupper ret "mul(_Float16 __vector(8), _Float16 __vector(8))": vcvtph2ps ymm2, xmm0 vcvtph2ps ymm0, xmm1 vmulps ymm0, ymm0, ymm2 vcvtps2ph xmm0, ymm0, 4 vzeroupper ret "div(_Float16 __vector(8), _Float16 __vector(8))": vcvtph2ps ymm1, xmm1 vcvtph2ps ymm0, xmm0 vdivps ymm0, ymm0, ymm1 vcvtps2ph xmm0, ymm0, 4 vzeroupper ret Currently, vector conversions without AVX512-FP16 are already bad. The VCVTPH2PSX is documented to be different only wrt. broadcasts ("The VCVTPH2PSX instruction has the embedded broadcasting option available."). So whatever the backend does for VCVTPH2PSX it can do for VCVTPH2PS if F16C is available. Otherwise, GCC already uses the v(add|sub|mul|div)ss instructions on each scalar. It should simply use the vector instruction instead. (BTW, optimizing multiple _Float16 operations into a sequence of float instructions to avoid intermediate conversions appears to be valid C++ as long as FLT_EVAL_METHOD is set to -1 (indeterminable).)