https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819
--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Alexander Monakov from comment #5) > Note that your code compares throughput. A microbenchmark for comparing > latency would chain dependent computations, e.g. like this: Ok, the 2 divisions manage to be about 7% faster in that example on skylake (and -mrecip makes the code almost 40% slower...). > > Maybe the right choice is clearer for double than for float? I would still > > go with an unconditional 2, for simplicity. > > Ack. I just want to point out that it's not so clear-cut given the trend for > improved pipelining of division in the latest cpu generations. Ok. For now, I would go with 2 at least for double (unless we have a way to detect the rare cases where the latency hurts), and maybe revisit if the pipelining of divisions keeps improving faster than the latency of multiplication.