https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118141
--- Comment #1 from Richard Yao <richard.yao at alumni dot stonybrook.edu> --- As an additional comment, while Clang does a good job on this function, it could do better. In specific, this uses 1 less instruction: convert_fp32_to_bfloat16: vmovups (%rdi), %ymm0 vpsrld $16, %ymm0, %ymm0 vphaddw %ymm0, %ymm0, %ymm0 vmovdqu %xmm0, (%rsi) vzeroupper ret Using vphaddw to do __builtin_convertvector() works here because we know the top 16-bit value of every 32-bit lane is 0 due to the shift operation. That said, I am not sure if this would be a worthwhile optimization to implement once the original optimization bug is fixed.