https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121587
--- Comment #2 from Matthias Kretz (Vir) <mkretz at gcc dot gnu.org> --- (In reply to Richard Biener from comment #1) > I also wonder whether one can efficiently emulate FP16 rounding/truncation > on floats or whether the actual conversion roundtrip is more efficient. The Intel Intrinsics Guide documents a latency of 7/8 and throughput of 1 for ph2ps and ps2ph (the YMM variants). So the roundtrip has a latency of 14/15. vroundps also has latency 8, throughput 1. For range reduction a mul - mul sequence (and I'm not sure whether that's sufficient) would increase that latency by another 10? If we ignore range reduction then inserting vroundps instead of FP16 roundtrips would be cheaper, yes. Otherwise, I don't see how. The two types float16_eval_as_float32_if_faster and float16_with_guaranteed_precision_and_range are both useful but have different applications. They should be two types rather than a compiler flag. But that's true for all floating-point types and fast-math flags ... (I'm repeating myself on this point.)