------- Comment #16 from ubizjak at gmail dot com 2007-06-10 16:24 ------- (In reply to comment #13)
> > x1 = 0.5 X0 (3.0 - A x0 x0 x0) Whops! One x0 too much above. Correct calcualtion reads: rsqrt = 0.5 rsqrt(a) (3.0 - a rsqrt(a) rsqrt(a)). > Well, I suppose it depends on the hardware. IIRC older cpu:s did division with > microcode whereas at least core2 and K8 do it in hardware, so I guess the > hundreds of cycles doesn't apply to current cpu:s. > > Also, supposedly Penryn will have a much improved divider.. Well, mubench says for my Core2Duo that _all_ sqrt and div functions have latency of 6 clocks and rcp throughput of 5 clks. By _all_ I mean divss, divps, divsd, divpd, sqrtss, sqrtps, sqrtsd and sqrtpd. OTOH, rsqrtss and rcpss have latency of 3 clks and rcp throughput of 2 clks. This is just amazing. > That being said, I think there is still a case for the reciprocal square root, > as evidenced by the benchmarks in #5 and #7 as well as my analysis of gas_dyn > linked to in the first message in this PR (in short, ifort does sqrt(a/b) > about > twice as fast as gfortran by using reciprocal approximations + NR). If indeed > div(p|s)s is about equally fast as rcp(p|s)s as your benchmarks show, then it > suggests almost all the performance benefit ifort gets is due to the > rsqrt(p|s)s, no? Or perhaps there is some issue with pipelining? In gas_dyn > the > sqrt(a/b) loop fills an array, whereas your benchmark accumulates.. It is true, that only a trivial accumulation function is benchmarked by my "benchmark". I can prepare a bunch of expanders to expand: a / b <=> a [rcpss(b) (2.0 - b rcpss(b))] a / sqrtss(b) <=> a [0.5 rsqrtss(b) (3.0 - b rsqrtss(b) rsqrtss(b))]. sqrtss (a) <=> a 0.5 rsqrtss(a) (3.0 - a rsqrtss(a) rsqrtss(a)) second and third case indeed look similar... > I hear that it's possible to pass spec2k6/gromacs without the NR step. As most > MD programs, gromacs spends almost all it's time in the force calculations, > where the majority of time is spent calculating 1/sqrt(...). So perhaps one > should watch out for compilers that get suspiciously high scores on that > benchmark. :) Yes, look at hpcwire article in Comment #12 > No, I'm not suggesting gcc should do this. ;)) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=31723