https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819
--- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #3)
> I think there may be realistic situations where the change can introduce a
> regression: while a win throughput-wise, it introduces one multiplication
> latency following division latency in the dependency chain, so if the
> original divisions were on the critical path, it grows longer.
But unless your FPU can do 2 divisions in parallel, you have to take into
account the delay before a second division can start (related to throughput),
which is often larger than the latency of a multiplication.
To try your example:
__attribute__((noipa))
void g(double,double){}
__attribute__((noipa))
void f(double a,double b,double c){
#if 0
double d=1/c;
g(a*d,b*d);
#else
g(a/c,b/c);
#endif
}
int main(){
for(int i=0;i<400000000;++i)f(3,7,11);
}
On skylake, I am getting 1s for the 2 divisions and .75s for the
inverse+multiplication. With float, both are .75s.
Maybe the right choice is clearer for double than for float? I would still go
with an unconditional 2, for simplicity.