https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86819

--- Comment #4 from Marc Glisse <glisse at gcc dot gnu.org> ---
(In reply to Alexander Monakov from comment #3)
> I think there may be realistic situations where the change can introduce a
> regression: while a win throughput-wise, it introduces one multiplication
> latency following division latency in the dependency chain, so if the
> original divisions were on the critical path, it grows longer.

But unless your FPU can do 2 divisions in parallel, you have to take into
account the delay before a second division can start (related to throughput),
which is often larger than the latency of a multiplication.

To try your example:

__attribute__((noipa))
  void g(double,double){}
__attribute__((noipa))
void f(double a,double b,double c){
#if 0
  double d=1/c;
  g(a*d,b*d);
#else
  g(a/c,b/c);
#endif
}
int main(){
  for(int i=0;i<400000000;++i)f(3,7,11);
}

On skylake, I am getting 1s for the 2 divisions and .75s for the
inverse+multiplication. With float, both are .75s.

Maybe the right choice is clearer for double than for float? I would still go
with an unconditional 2, for simplicity.

Reply via email to