https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121587

            Bug ID: 121587
           Summary: _Float16 vector operations should use
                    addps/subps/mulps/divps if F16C is present
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mkretz at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Test case (https://compiler-explorer.com/z/hfs4q7n66):

using V [[gnu::vector_size(16)]] = _Float16;

V add(V a, V b) { return a + b; }

V sub(V a, V b) { return a - b; }

V mul(V a, V b) { return a * b; }

V div(V a, V b) { return a / b; }

With F16C (x86-64-v3) and without AVX512-FP16, these operations should do:

"add(_Float16 __vector(8), _Float16 __vector(8))":
        vcvtph2ps       ymm2, xmm0
        vcvtph2ps       ymm0, xmm1
        vaddps  ymm0, ymm0, ymm2
        vcvtps2ph       xmm0, ymm0, 4
        vzeroupper
        ret
"sub(_Float16 __vector(8), _Float16 __vector(8))":
        vcvtph2ps       ymm1, xmm1
        vcvtph2ps       ymm0, xmm0
        vsubps  ymm0, ymm0, ymm1
        vcvtps2ph       xmm0, ymm0, 4
        vzeroupper
        ret
"mul(_Float16 __vector(8), _Float16 __vector(8))":
        vcvtph2ps       ymm2, xmm0
        vcvtph2ps       ymm0, xmm1
        vmulps  ymm0, ymm0, ymm2
        vcvtps2ph       xmm0, ymm0, 4
        vzeroupper
        ret
"div(_Float16 __vector(8), _Float16 __vector(8))":
        vcvtph2ps       ymm1, xmm1
        vcvtph2ps       ymm0, xmm0
        vdivps  ymm0, ymm0, ymm1
        vcvtps2ph       xmm0, ymm0, 4
        vzeroupper
        ret

Currently, vector conversions without AVX512-FP16 are already bad. The
VCVTPH2PSX is documented to be different only wrt. broadcasts ("The VCVTPH2PSX
instruction has the embedded broadcasting option available."). So whatever the
backend does for VCVTPH2PSX it can do for VCVTPH2PS if F16C is available.

Otherwise, GCC already uses the v(add|sub|mul|div)ss instructions on each
scalar. It should simply use the vector instruction instead.

(BTW, optimizing multiple _Float16 operations into a sequence of float
instructions to avoid intermediate conversions appears to be valid C++ as long
as FLT_EVAL_METHOD is set to -1 (indeterminable).)

Reply via email to