divps if F16C is present

mkretz at gcc dot gnu.org via Gcc-bugs Tue, 19 Aug 2025 10:25:03 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121587


--- Comment #2 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #1)
> I also wonder whether one can efficiently emulate FP16 rounding/truncation
> on floats or whether the actual conversion roundtrip is more efficient.

The Intel Intrinsics Guide documents a latency of 7/8 and throughput of 1 for
ph2ps and ps2ph (the YMM variants). So the roundtrip has a latency of 14/15.

vroundps also has latency 8, throughput 1. For range reduction a mul - mul
sequence (and I'm not sure whether that's sufficient) would increase that
latency by another 10? If we ignore range reduction then inserting vroundps
instead of FP16 roundtrips would be cheaper, yes. Otherwise, I don't see how.

The two types float16_eval_as_float32_if_faster and
float16_with_guaranteed_precision_and_range are both useful but have different
applications. They should be two types rather than a compiler flag. But that's
true for all floating-point types and fast-math flags ... (I'm repeating myself
on this point.)

[Bug target/121587] _Float16 vector operations should use addps/subps/mulps/divps if F16C is present

Reply via email to