https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119626

--- Comment #6 from mcccs at gmx dot com ---
Lastly I would like to mention why this is such an important issue in the use
__bf16 and why __bf16 is otherwise very inefficient: bfcvt is not only used for
casts. Consider the following code:

__bf16 a[4];
void multiply() {
    for (int i = 0; i < 4; i++)
        a[i] *= 16;
}

It does involve the bfcvt instruction.

The function compiles to:

Clang O3 -bf16: 13 instructions

Clang O3 +bf16: 8 instructions

GCC O3 +bf16: 43 instructions

It seems there are two parts to solving the problem. By comparing with Clang,
first is to ensure

__bf16 convert(float x) {
    return (__bf16) x;
}

uses bfcvt

the second is to ensure

void convert2(float * __restrict a, __bf16 * __restrict x) {
    for (int i = 0; i < 4; i++)
        x[i] = (__bf16)a[i];
}

can be vectorized even with march=...-bf16

Reply via email to