https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119626
--- Comment #5 from mcccs at gmx dot com --- Sorry for another ping. I did some more research and to make it easier for you to confirm this issue, we can confirm the expected behavior with clang: Clang behavior -march=armv9-a+bf16 -O3: void convert1(int * __restrict a, __bf16 * __restrict x) { for (int i = 0; i < 4; i++) x[i] = (__bf16)a[i]; } void convert2(float * __restrict a, __bf16 * __restrict x) { for (int i = 0; i < 4; i++) x[i] = (__bf16)a[i]; } produces: convert1(int*, __bf16*): ldr q0, [x0] scvtf v0.4s, v0.4s bfcvtn v0.4h, v0.4s str d0, [x1] ret convert2(float*, __bf16*): ldr q0, [x0] bfcvtn v0.4h, v0.4s str d0, [x1] ret whereas with GCC the produced assembly is not only not using bfcvt, it is also unvectorized. So Clang without `+bf16` can vectorize it but GCC can't. Maybe this should be split into two separate bugs (one for vectorizing it with `-bf16` and one for using the bfcvt instruction if `+bf16`).