https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819
--- Comment #5 from Alexander Monakov <amonakov at gcc dot gnu.org> --- (In reply to Marc Glisse from comment #4) > But unless your FPU can do 2 divisions in parallel, you have to take into > account the delay before a second division can start (related to > throughput), which is often larger than the latency of a multiplication. Yep - Agner's tables indicate that starting with Ivybridge, divss is partially pipelined, and on SkylakeX it has reciprocal throughput of just 3 cycles, which is smaller than mulss latency (4). On Ryzen it's similar. > To try your example: [snip] > On skylake, I am getting 1s for the 2 divisions and .75s for the > inverse+multiplication. With float, both are .75s. Note that your code compares throughput. A microbenchmark for comparing latency would chain dependent computations, e.g. like this: int main(){ float a=3, b=7; for(int i=0;i<100000000;++i) { float c = a+b; float d = 1/c; #if 0 a /= c; b /= c; #else a *= d; b *= d; #endif } __builtin_printf("%g %g\n", a, b); } > Maybe the right choice is clearer for double than for float? I would still > go with an unconditional 2, for simplicity. Ack. I just want to point out that it's not so clear-cut given the trend for improved pipelining of division in the latest cpu generations.