On Mon, Feb 15, 2016 at 11:24:53AM -0600, Evandro Menezes wrote: > On 02/15/16 04:50, James Greenhalgh wrote: > >On Mon, Feb 08, 2016 at 10:57:10AM +0000, James Greenhalgh wrote: > >>On Mon, Feb 01, 2016 at 02:00:01PM +0000, James Greenhalgh wrote: > >>>On Mon, Jan 25, 2016 at 11:20:46AM +0000, James Greenhalgh wrote: > >>>>On Mon, Jan 11, 2016 at 12:04:43PM +0000, James Greenhalgh wrote: > >>>>>Hi, > >>>>> > >>>>>I've seen a couple of large performance issues caused by expanding > >>>>>the high-precision reciprocal square root for Cortex-A57, so I'd like > >>>>>to turn it off by default. > >>>>> > >>>>>This is good for art (~2%) from Spec2000, bad (~3.5%) for fma3d from > >>>>>Spec2000, good (~5.5%) for gromcas from Spec2006, and very good (>10%) > >>>>>for > >>>>>some private microbenchmark kernels which stress the divide/sqrt/multiply > >>>>>units. It therefore seems to me to be the correct choice to make across > >>>>>a number of workloads. > >>>>> > >>>>>Bootstrapped and tested on aarch64-none-linux-gnu with no issues. > >>>>> > >>>>>OK? > >>>>*Ping* > >>>*pingx2* > >>*ping^3* > >*ping^4* > > > >Thanks, > >James > > > >>>>>--- > >>>>>2015-12-11 James Greenhalgh <james.greenha...@arm.com> > >>>>> > >>>>> * config/aarch64/aarch64.c (cortexa57_tunings): Remove > >>>>> AARCH64_EXTRA_TUNE_RECIP_SQRT. > >>>>> > >>>>>diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > >>>>>index 1d5d898..999c9fc 100644 > >>>>>--- a/gcc/config/aarch64/aarch64.c > >>>>>+++ b/gcc/config/aarch64/aarch64.c > >>>>>@@ -484,8 +484,7 @@ static const struct tune_params cortexa57_tunings = > >>>>> 0, /* max_case_values. */ > >>>>> 0, /* cache_line_size. */ > >>>>> tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ > >>>>>- (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS > >>>>>- | AARCH64_EXTRA_TUNE_RECIP_SQRT) /* tune_flags. */ > >>>>>+ (AARCH64_EXTRA_TUNE_RENAME_FMA_REGS) /* tune_flags. */ > >>>>> }; > >>>>> static const struct tune_params cortexa72_tunings = > > > > James, > > There seem to be SPEC CPU2000fp validation issues on A57 when this > flag is present too. Though I evaluated the algorithm with a huge > random set of values, always delivering accuracy around 1ulp, which > should be enough for CPU2000fp (wit x86-64), I expected the > benchmarks to pass. > > My suspicion is that the Newton series on AArch64 is probably good > only for SP. Then, DP might require an extra round, probably > exacerbating the performance penalty. > > I'd like to try to split this tuning option into one for SP and > another for DP. Thoughts?
I haven't seen validation issues with the default expansion, but with -mlow-precision-recip-sqrt I do see failures. I think this is to be expected. I don't support splitting the low-precision flag to "-mlow-precision-float-recip-sqrt" and "-mlow-precision-double-recip-sqrt", I think that is pushing a particular set of Spec tuning flags over any meaningful use case. I could imagine a case for splitting the internal tuning flag to give AARCH64_EXTRA_TUNE_SF_RECIP_SQRT and AARCH64_EXTRA_TUNE_DF_RECIP_SQRT, but I'm not sure I understand the benefits of this? Certainly, I think your goals for performance (turn on for 64-bit divide/sqrt) would contradict your goals for accuracy (turn off for 64-bit divide/sqrt). I'm happy with these flags as they are, but I might be missing a subtelty in your argument? Thanks, James