[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

ubizjak at gmail dot com Sun, 10 Jun 2007 09:25:12 -0700


------- Comment #16 from ubizjak at gmail dot com  2007-06-10 16:24 -------
(In reply to comment #13)


> > x1 = 0.5 X0 (3.0 - A x0 x0 x0)

Whops! One x0 too much above. Correct calcualtion reads:

rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)).

> Well, I suppose it depends on the hardware. IIRC older cpu:s did division with
> microcode whereas at least core2 and K8 do it in hardware, so I guess the
> hundreds of cycles doesn't apply to current cpu:s. 
> 
> Also, supposedly Penryn will have a much improved divider..

Well, mubench says for my Core2Duo that _all_ sqrt and div functions have
latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps,
divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have
latency of 3 clks and rcp throughput of 2 clks. This is just amazing.

> That being said, I think there is still a case for the reciprocal square root,
> as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn
> linked to in the first message in this PR (in short, ifort does sqrt(a/b) 
> about
> twice as fast as gfortran by using reciprocal approximations + NR). If indeed
> div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it
> suggests almost all the performance benefit ifort gets is due to the
> rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn 
> the
> sqrt(a/b) loop fills an array, whereas your benchmark accumulates..

It is true, that only a trivial accumulation function is benchmarked by my
"benchmark". I can prepare a bunch of expanders to expand:

a / b <=> a [rcpss(b) (2.0 - b rcpss(b))]

a / sqrtss(b) <=> a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))].

sqrtss (a) <=> a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a))

second and third case indeed look similar...

> I hear that it's possible to pass spec2k6/gromacs without the NR step. As most
> MD programs, gromacs spends almost all it's time in the force calculations,
> where the majority of time is spent calculating 1/sqrt(...). So perhaps one
> should watch out for compilers that get suspiciously high scores on that
> benchmark. :)

Yes, look at hpcwire article in Comment #12

> No, I'm not suggesting gcc should do this.

;))


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723

[Bug middle-end/31723] Use reciprocal and reciprocal square root with -ffast-math

Reply via email to