https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118141
--- Comment #4 from Richard Yao <richard.yao at alumni dot stonybrook.edu> --- (In reply to Richard Yao from comment #1) > As an additional comment, while Clang does a good job on this function, it > could do better. In specific, this uses 1 less instruction: > > convert_fp32_to_bfloat16: > vmovups (%rdi), %ymm0 > vpsrld $16, %ymm0, %ymm0 > vphaddw %ymm0, %ymm0, %ymm0 > vmovdqu %xmm0, (%rsi) > vzeroupper > ret > > Using vphaddw to do __builtin_convertvector() works here because we know the > top 16-bit value of every 32-bit lane is 0 due to the shift operation. That > said, I am not sure if this would be a worthwhile optimization to implement > once the original optimization bug is fixed. Disregard this. I made a mistake when reviewing the output from this code. Clang's method is the best that I can see for doing this.