https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819
--- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> --- (In reply to Alexander Monakov from comment #3) > I think there may be realistic situations where the change can introduce a > regression: while a win throughput-wise, it introduces one multiplication > latency following division latency in the dependency chain, so if the > original divisions were on the critical path, it grows longer. But unless your FPU can do 2 divisions in parallel, you have to take into account the delay before a second division can start (related to throughput), which is often larger than the latency of a multiplication. To try your example: __attribute__((noipa)) void g(double,double){} __attribute__((noipa)) void f(double a,double b,double c){ #if 0 double d=1/c; g(a*d,b*d); #else g(a/c,b/c); #endif } int main(){ for(int i=0;i<400000000;++i)f(3,7,11); } On skylake, I am getting 1s for the 2 divisions and .75s for the inverse+multiplication. With float, both are .75s. Maybe the right choice is clearer for double than for float? I would still go with an unconditional 2, for simplicity.