RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

Kumar, Venkataramanan Mon, 20 Jul 2015 00:54:08 -0700

Hi, 

I missed your email and noticed it this week.


What does column 2  tests?  Are you trying to implement square roots  using 
reciprocal estimate and step? 

But reciprocal square root  using reciprocal estimate and (2 for fp 3 for dp) 
step seems  to be better that using fdiv and fsqrt in your case.   

Regards,
Venkat.

> -----Original Message-----
> From: Evandro Menezes [mailto:[email protected]]
> Sent: Wednesday, July 15, 2015 3:45 AM
> To: Kumar, Venkataramanan; [email protected]; 'Dr. Philipp Tomsich'
> Cc: 'James Greenhalgh'; 'Benedikt Huber'; [email protected]; 'Marcus
> Shawcroft'; 'Ramana Radhakrishnan'; 'Richard Earnshaw'
> Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt)
> estimation in -ffast-math
> 
> I ran a simple test on A57 rev. 0, looping a million times around sqrt{,f} and
> the respective series iterations with the values in the sequence 1..1000000
> and got these results:
> 
> sqrt(x):        36593844/s      1/sqrt(x):      18283875/s
> 3 Steps:        47922557/s      3 Steps:        49005194/s
> 
> sqrtf(x):       143988480/s     1/sqrtf(x):     69516857/s
> 2 Steps:        78740157/s      2 Steps:        80385852/s
> 
> I'm a bit surprised that the 3-iteration series for DP is faster than sqrt(), 
> but
> not that it's much faster for the reciprocal of sqrt().  As for SP, the 
> 2-iteration
> series is faster only for the reciprocal for sqrtf().
> 
> There might still be some leg for this patch in real-world cases which I'd 
> like to
> investigate.
> 
> --
> Evandro Menezes                              Austin, TX
> 
> 
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]] On Behalf Of Kumar,
> > Venkataramanan
> > Sent: Monday, June 29, 2015 13:50
> > To: [email protected]; Dr. Philipp Tomsich
> > Cc: James Greenhalgh; Benedikt Huber; [email protected]; Marcus
> > Shawcroft; Ramana Radhakrishnan; Richard Earnshaw
> > Subject: RE: [PATCH] [aarch64] Implemented reciprocal square root
> > (rsqrt) estimation in -ffast-math
> >
> > Hi,
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]]
> > > Sent: Monday, June 29, 2015 10:23 PM
> > > To: Dr. Philipp Tomsich
> > > Cc: James Greenhalgh; Kumar, Venkataramanan; Benedikt Huber; gcc-
> > > [email protected]; Marcus Shawcroft; Ramana Radhakrishnan;
> Richard
> > > Earnshaw
> > > Subject: Re: [PATCH] [aarch64] Implemented reciprocal square root
> > > (rsqrt) estimation in -ffast-math
> > >
> > >
> > >
> > >
> > >
> > > > On Jun 29, 2015, at 4:44 AM, Dr. Philipp Tomsich
> > > <[email protected]> wrote:
> > > >
> > > > James,
> > > >
> > > >> On 29 Jun 2015, at 13:36, James Greenhalgh
> > > <[email protected]> wrote:
> > > >>
> > > >>> On Mon, Jun 29, 2015 at 10:18:23AM +0100, Kumar, Venkataramanan
> > > wrote:
> > > >>>
> > > >>>> -----Original Message-----
> > > >>>> From: Dr. Philipp Tomsich
> > > >>>> [mailto:[email protected]]
> > > >>>> Sent: Monday, June 29, 2015 2:17 PM
> > > >>>> To: Kumar, Venkataramanan
> > > >>>> Cc: [email protected]; Benedikt Huber; [email protected]
> > > >>>> Subject: Re: [PATCH] [aarch64] Implemented reciprocal square
> > > >>>> root
> > > >>>> (rsqrt) estimation in -ffast-math
> > > >>>>
> > > >>>> Kumar,
> > > >>>>
> > > >>>> This does not come unexpected, as the initial estimation and
> > > >>>> each iteration will add an architecturally-defined number of
> > > >>>> bits of precision (ARMv8 guarantuees only a minimum number of
> > > >>>> bits
> > > provided
> > > >>>> per operation… the exact number is specific to each micro-arch,
> > > though).
> > > >>>> Depending on your architecture and on the required number of
> > > >>>> precise bits by any given benchmark, one may see miscompares.
> > > >>>
> > > >>> True.
> > > >>
> > > >> I would be very uncomfortable with this approach.
> > > >
> > > > Same here. The default must be safe. Always.
> > > > Unlike other architectures, we don’t have a problem with making
> > > > the proper defaults for “safety”, as the ARMv8 ISA guarantees a
> > > > minimum number of precise bits per iteration.
> > > >
> > > >> From Richard Biener's post in the thread Michael Matz linked
> > > >> earlier in the thread:
> > > >>
> > > >>   It would follow existing practice of things we allow in
> > > >>   -funsafe-math-optimizations.  Existing practice in that we
> > > >>   want to allow -ffast-math use with common benchmarks we care
> > > >>   about.
> > > >>
> > > >>   https://gcc.gnu.org/ml/gcc-patches/2009-11/msg00100.html
> > > >>
> > > >> With the solution you seem to be converging on (2-steps for some
> > > >> microarchitectures, 3 for others), a binary generated for one
> > > >> micro-arch may drop below a minimum guarantee of precision when
> > > >> run on another. This seems to go against the spirit of the
> > > >> practice above. I would only support adding this optimization to
> > > >> -Ofast if we could keep to architectural guarantees of precision
> > > >> in the generated code
> > > (i.e. 3-steps everywhere).
> > > >>
> > > >> I don't object to adding a "-mlow-precision-recip-sqrt" style
> > > >> option, which would be off by default, would enable the 2-step
> > > >> mode, and would need to be explicitly enabled (i.e. not implied
> > > >> by
> > > >> -mcpu=foo) but I don't see what this buys you beyond the Gromacs
> > > >> boost (and even there you would be creating an Invalid Run as
> > > >> optimization flags must be applied across all workloads).
> > > >
> > > > Any flag that reduces precision (and thus breaks IEEE
> > > > floating-point
> > > > semantics) needs to be gated with an “unsafe” flag (i.e. one that
> > > > is never
> > > on by default).
> > > > As a consequence, the “peak”-tuning for SPEC will turn this on…
> > > > but barely anyone else would.
> > > >
> > > >> For the 3-step optimization, it is clear to me that for "generic"
> > > >> tuning we don't want this to be enabled by default experimental
> > > >> results and advice in this thread argues against it for thunderx
> > > >> and cortex-
> > > a57 targets.
> > > >> However, enabling it based on the CPU tuning selected seems fine to
> me.
> > > >
> > > > I do not agree on this one, as I would like to see the safe form (i.e.
> > > > 3 and 5 iterations respectively) to become the default. Most
> > > > “server-type” chips should not see a performance regression, while
> > > > it will be easier to optimise for this in hardware than for a
> > > > (potentially microcoded) sqrt-instruction (and subsequent,
> > > > dependent
> > > divide).
> > > >
> > > > I have not heard anyone claim a performance regression (either on
> > > > thunderx or on cortex-a57), but merely heard a “no speed-up”.
> > >
> > > Actually it does regress performance on thunderX, I just assumed
> > > that when I said not going to be a win it was taken as a slow down.
> > > It regress gromacs by more than 10% on thunderX but I can't remember
> > > how much as i had someone else run it. The latency difference is
> > > also over 40%; for example single precision: 29 cycles with div (12)
> > > sqrt(17) directly vs 42 cycles with the rsqrte and 2 iterations of
> > > 2mul/rsqrts (double is 53 vs 60). That is huge difference right
> > > there.  ThunderX has a
> > fast div and a fast sqrt for 32bit and a
> > > reasonable one for double.   So again this is not just not a win but 
> > > rather
> > a
> > > regression for thunderX. I suspect cortex-a57 is also true.
> > >
> > > Thanks,
> > > Andrew
> > >
> >
> > Yes theoretically  should be  true for cortex-57 case as well.   But  I
> > believe hardware pipelining with instruction scheduling in compiler
> > helps a little for gromacs case  ~3% to 4% with the original patch.
> >
> > I have not tested other FP benchmarks.   As James said a flag -mlow-
> > precision-recip-sqrt if allowed can be used as a peak flag.
> >
> > > >
> > > > So I am strongly in favor of defaulting to the ‘safe’ number of
> > > > iterations, even when compiling for a generic target.
> > > >
> > > > Best,
> > > > Philipp.
> > > >
> >
> > Regards,
> > Venkat.

RE: [PATCH] [aarch64] Implemented reciprocal square root (rsqrt) estimation in -ffast-math

Reply via email to