https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819

--- Comment #6 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #5)
> Note that your code compares throughput. A microbenchmark for comparing
> latency would chain dependent computations, e.g. like this:

Ok, the 2 divisions manage to be about 7% faster in that example on skylake
(and -mrecip makes the code almost 40% slower...).

> > Maybe the right choice is clearer for double than for float? I would still
> > go with an unconditional 2, for simplicity.
> 
> Ack. I just want to point out that it's not so clear-cut given the trend for
> improved pipelining of division in the latest cpu generations.

Ok. For now, I would go with 2 at least for double (unless we have a way to
detect the rare cases where the latency hurts), and maybe revisit if the
pipelining of divisions keeps improving faster than the latency of
multiplication.

Reply via email to