For both FRECPE and FRSQRTE the ARMv8 ISA guide states in their pseudo-code that:
"Result is double-precision and a multiple of 1/256 in the range 1 to 511/256." This suggests that the estimate is merely 8 bits long. IIRC, x86 returns 12 bits for its equivalent insns, requiring then a single series iteration for both SP and DP to achieve a precise enough result. -- Evandro Menezes Austin, TX > -----Original Message----- > From: [email protected] [mailto:[email protected]] On > Behalf Of Dr. Philipp Tomsich > Sent: Monday, June 29, 2015 3:47 > To: Kumar, Venkataramanan > Cc: [email protected]; Benedikt Huber; [email protected] > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) > estimation in -ffast-math > > Kumar, > > This does not come unexpected, as the initial estimation and each iteration > will add an architecturally-defined number of bits of precision (ARMv8 > guarantuees only a minimum number of bits provided per operation… the exact > number is specific to each micro-arch, though). > Depending on your architecture and on the required number of precise bits by > any given benchmark, one may see miscompares. > > Do you know the exact number of bits that the initial estimate and the > subsequent refinement steps add for your micro-arch? > > Thanks, > Philipp. > > > On 29 Jun 2015, at 10:17, Kumar, Venkataramanan > <[email protected]> wrote: > > > > > > Hmm, Reducing the iterations to "1 step for float" and "2 steps for > double" > > > > I got VE (miscompares) on following benchmarks 416.gamess > > 453.povray > > 454.calculix > > 459.GemsFDTD > > > > Benedikt , I have ICE for 444.namd with your patch, not sure if something > wrong in my local tree. > > > > Regards, > > Venkat. > > > >> -----Original Message----- > >> From: [email protected] [mailto:[email protected]] > >> Sent: Sunday, June 28, 2015 8:35 PM > >> To: Kumar, Venkataramanan > >> Cc: Dr. Philipp Tomsich; Benedikt Huber; [email protected] > >> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >> (rsqrt) estimation in -ffast-math > >> > >> > >> > >> > >> > >>> On Jun 25, 2015, at 9:44 AM, Kumar, Venkataramanan > >> <[email protected]> wrote: > >>> > >>> I got around ~12% gain with -Ofast -mcpu=cortex-a57. > >> > >> I get around 11/12% on thunderX with the patch and the decreasing the > >> iterations change (1/2) compared to without the patch. > >> > >> Thanks, > >> Andrew > >> > >> > >>> > >>> Regards, > >>> Venkat. > >>> > >>>> -----Original Message----- > >>>> From: [email protected] [mailto:gcc-patches- > >>>> [email protected]] On Behalf Of Dr. Philipp Tomsich > >>>> Sent: Thursday, June 25, 2015 9:13 PM > >>>> To: Kumar, Venkataramanan > >>>> Cc: Benedikt Huber; [email protected]; [email protected] > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >>>> (rsqrt) estimation in -ffast-math > >>>> > >>>> Kumar, > >>>> > >>>> what is the relative gain that you see on Cortex-A57? > >>>> > >>>> Thanks, > >>>> Philipp. > >>>> > >>>>>> On 25 Jun 2015, at 17:35, Kumar, Venkataramanan > >>>>> <[email protected]> wrote: > >>>>> > >>>>> Changing to "1 step for float" and "2 steps for double" gives > >>>>> better gains > >>>> now for gromacs on cortex-a57. > >>>>> > >>>>> Regards, > >>>>> Venkat. > >>>>>> -----Original Message----- > >>>>>> From: [email protected] [mailto:gcc-patches- > >>>>>> [email protected]] On Behalf Of Benedikt Huber > >>>>>> Sent: Thursday, June 25, 2015 4:09 PM > >>>>>> To: [email protected] > >>>>>> Cc: [email protected]; philipp.tomsich@theobroma- > >> systems.com > >>>>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root > >>>>>> (rsqrt) estimation in -ffast-math > >>>>>> > >>>>>> Andrew, > >>>>>> > >>>>>>> This is NOT a win on thunderX at least for single precision > >>>>>>> because you have > >>>>>> to do the divide and sqrt in the same time as it takes 5 > >>>>>> multiples (estimate and step are multiplies in the thunderX pipeline). > >>>>>> Doubles is 10 multiplies which is just the same as what the patch > >>>>>> does (but it is really slightly less than 10, I rounded up). So > >>>>>> in the end this is NOT a win at all for thunderX unless we do one > >>>>>> less step for both single > >>>> and double. > >>>>>> > >>>>>> Yes, the expected benefit from rsqrt estimation is implementation > >>>>>> specific. If one has a better initial rsqrte or an application > >>>>>> that can trade precision for execution time, we could offer a > >>>>>> command line option to do only 2 steps for doulbe and 1 step for > >>>>>> float; similar to - > >>>> mrecip-precision for PowerPC. > >>>>>> What are your thoughts on that? > >>>>>> > >>>>>> Best regards, > >>>>>> Benedikt > >>>
