https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119626

--- Comment #5 from mcccs at gmx dot com ---
Sorry for another ping. I did some more research and to make it easier for you
to confirm this issue, we can confirm the expected behavior with clang:

Clang behavior -march=armv9-a+bf16  -O3:

void convert1(int * __restrict a, __bf16 * __restrict x) {
    for (int i = 0; i < 4; i++)
        x[i] = (__bf16)a[i];
}

void convert2(float * __restrict a, __bf16 * __restrict x) {
    for (int i = 0; i < 4; i++)
        x[i] = (__bf16)a[i];
}

produces:

convert1(int*, __bf16*):
        ldr     q0, [x0]
        scvtf   v0.4s, v0.4s
        bfcvtn  v0.4h, v0.4s
        str     d0, [x1]
        ret

convert2(float*, __bf16*):
        ldr     q0, [x0]
        bfcvtn  v0.4h, v0.4s
        str     d0, [x1]
        ret


whereas with GCC the produced assembly is not only not using bfcvt, it is also
unvectorized. So Clang without `+bf16` can vectorize it but GCC can't. Maybe
this should be split into two separate bugs (one for vectorizing it with
`-bf16` and one for using the bfcvt instruction if `+bf16`).

Reply via email to