https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118141

--- Comment #4 from Richard Yao <richard.yao at alumni dot stonybrook.edu> ---
(In reply to Richard Yao from comment #1)
> As an additional comment, while Clang does a good job on this function, it
> could do better. In specific, this uses 1 less instruction:
> 
> convert_fp32_to_bfloat16:
>         vmovups (%rdi), %ymm0
>         vpsrld  $16, %ymm0, %ymm0
>         vphaddw %ymm0, %ymm0, %ymm0
>         vmovdqu %xmm0, (%rsi)
>         vzeroupper
>         ret
> 
> Using vphaddw to do __builtin_convertvector() works here because we know the
> top 16-bit value of every 32-bit lane is 0 due to the shift operation. That
> said, I am not sure if this would be a worthwhile optimization to implement
> once the original optimization bug is fixed.

Disregard this. I made a mistake when reviewing the output from this code.
Clang's method is the best that I can see for doing this.

Reply via email to