> -----Original Message----- > From: Kyrylo Tkachov <ktkac...@nvidia.com> > Sent: Monday, August 12, 2024 3:54 PM > To: Tamar Christina <tamar.christ...@arm.com> > Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford > <richard.sandif...@arm.com> > Subject: Re: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse > V2 and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA > > Hi Tamar, > > > On 12 Aug 2024, at 16:48, Tamar Christina <tamar.christ...@arm.com> wrote: > > > > External email: Use caution opening links or attachments > > > > > > Hi Kyrill, > > > >> -----Original Message----- > >> From: Kyrylo Tkachov <ktkac...@nvidia.com> > >> Sent: Monday, August 12, 2024 3:07 PM > >> To: GCC Patches <gcc-patches@gcc.gnu.org> > >> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford > >> <richard.sandif...@arm.com> > >> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse > >> V2 > >> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA > >> > >> Hi all, > >> > >> The fp reassociation width for Neoverse V2 was set to 6 since its > >> introduction and I guess it was empirically tuned. But since > >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree > reassociation > >> pass seems to be more deliberate in forming FMAs and when that flag is > >> used it seems to more properly evaluate the FMA vs non-FMA reassociation > >> widths. > > > > Thanks!, evaluating this flag has been on our list for a long while now.. > > > >> According to the Neoverse V2 SWOG the core has a throughput of 4 for > >> most FP operations, so the value 6 is not accurate from first principles.. > >> Also, the SWOG does state that FMADD operations are pipelined and the > >> results can be forwarded from FP multiplies to the accumulation operands > >> of FMADD instructions, which seems to be what > >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses. > > > > Yeah, We've always overestimated the fp reassociation width I think because > > of > the > > FMA forming pass. > > > > My understanding was that the pass has historically been a bit unreliable > > and > that > > the overestimation was to cover up this. But over the years a lot have > > changed. > > > > So I don't think historically it was based on the actual throughput on the > > core. > > e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have > > a two FP units. > > > > That said from first principals I agree with the change, but I'm curious as > > to > whether you've > > also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA. i.e I'm > wondering > > if the performance improvements is coming from this flag alone. > > It did cross my mind. It’s been a couple weeks since I’ve done the > benchmarking > but I remember that neither of the changes alone gave the improvement, only > combined together. > AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA by itself won’t trigger due to the > logic fixed by https://gcc.gnu.org/PR116139 (it ICEd before). > Reducing the fp_reassoc_width didn’t give any difference either.
That's fair, from reading of the code in get_reassociation_width and how it's used I agree that setting it closer to the throughput makes sense. So there's no objections from me. I will test the other cores (new and old) if Richard S agrees. Thanks, Tamar > Thanks, > Kyrill > > > > > > Kind Regards, > > Tamar > > > >> > >> This patch sets the fp_reassoc_width field to 4 and enables > >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2. > >> > >> On SPEC2017 fprate I see the following changes on a Grace system: > >> 503.bwaves_r 0.16% > >> 507.cactuBSSN_r -0.32% > >> 508.namd_r 3.04% > >> 510.parest_r 0.00% > >> 511.povray_r 0.78% > >> 519.lbm_r 0.35% > >> 521.wrf_r 0.69% > >> 526.blender_r -0.53% > >> 527.cam4_r 0.84% > >> 538.imagick_r 0.00% > >> 544.nab_r -0.97% > >> 549.fotonik3d_r -0.45% > >> 554.roms_r 0.97% > >> Geomean 0.35% > >> > >> with -Ofast -mcpu=grace -flto. > >> > >> So slight overall improvement with a meaningful improvement in > >> 508.namd_r. > >> > >> I think other tunings in aarch64 should look into > >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the > >> benchmarking to someone else. > >> > >> Tamar, Richard, does the reasoning above make sense to you? > >> I know that FMA reassociation is something we’ve gone back and forth on in > the > >> backend… > >> > >> Thanks, > >> Kyrill > >> > >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > >> > >> gcc/ChangeLog: > >> > >> * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width): > >> Set to 4. > >> (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA. > >