https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119626
--- Comment #6 from mcccs at gmx dot com --- Lastly I would like to mention why this is such an important issue in the use __bf16 and why __bf16 is otherwise very inefficient: bfcvt is not only used for casts. Consider the following code: __bf16 a[4]; void multiply() { for (int i = 0; i < 4; i++) a[i] *= 16; } It does involve the bfcvt instruction. The function compiles to: Clang O3 -bf16: 13 instructions Clang O3 +bf16: 8 instructions GCC O3 +bf16: 43 instructions It seems there are two parts to solving the problem. By comparing with Clang, first is to ensure __bf16 convert(float x) { return (__bf16) x; } uses bfcvt the second is to ensure void convert2(float * __restrict a, __bf16 * __restrict x) { for (int i = 0; i < 4; i++) x[i] = (__bf16)a[i]; } can be vectorized even with march=...-bf16