https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118141

--- Comment #1 from Richard Yao <richard.yao at alumni dot stonybrook.edu> ---
As an additional comment, while Clang does a good job on this function, it
could do better. In specific, this uses 1 less instruction:

convert_fp32_to_bfloat16:
        vmovups (%rdi), %ymm0
        vpsrld  $16, %ymm0, %ymm0
        vphaddw %ymm0, %ymm0, %ymm0
        vmovdqu %xmm0, (%rsi)
        vzeroupper
        ret

Using vphaddw to do __builtin_convertvector() works here because we know the
top 16-bit value of every 32-bit lane is 0 due to the shift operation. That
said, I am not sure if this would be a worthwhile optimization to implement
once the original optimization bug is fixed.

Reply via email to