> On 12 Aug 2024, at 17:24, Tamar Christina <tamar.christ...@arm.com> wrote: > > External email: Use caution opening links or attachments > > >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Monday, August 12, 2024 3:54 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford >> <richard.sandif...@arm.com> >> Subject: Re: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse >> V2 and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA >> >> Hi Tamar, >> >>> On 12 Aug 2024, at 16:48, Tamar Christina <tamar.christ...@arm.com> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>> Hi Kyrill, >>> >>>> -----Original Message----- >>>> From: Kyrylo Tkachov <ktkac...@nvidia.com> >>>> Sent: Monday, August 12, 2024 3:07 PM >>>> To: GCC Patches <gcc-patches@gcc.gnu.org> >>>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford >>>> <richard.sandif...@arm.com> >>>> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse >>>> V2 >>>> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA >>>> >>>> Hi all, >>>> >>>> The fp reassociation width for Neoverse V2 was set to 6 since its >>>> introduction and I guess it was empirically tuned. But since >>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree >> reassociation >>>> pass seems to be more deliberate in forming FMAs and when that flag is >>>> used it seems to more properly evaluate the FMA vs non-FMA reassociation >>>> widths. >>> >>> Thanks!, evaluating this flag has been on our list for a long while now.. >>> >>>> According to the Neoverse V2 SWOG the core has a throughput of 4 for >>>> most FP operations, so the value 6 is not accurate from first principles.. >>>> Also, the SWOG does state that FMADD operations are pipelined and the >>>> results can be forwarded from FP multiplies to the accumulation operands >>>> of FMADD instructions, which seems to be what >>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses. >>> >>> Yeah, We've always overestimated the fp reassociation width I think because >>> of >> the >>> FMA forming pass. >>> >>> My understanding was that the pass has historically been a bit unreliable >>> and >> that >>> the overestimation was to cover up this. But over the years a lot have >>> changed. >>> >>> So I don't think historically it was based on the actual throughput on the >>> core. >>> e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have >>> a two FP units. >>> >>> That said from first principals I agree with the change, but I'm curious as >>> to >> whether you've >>> also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA. i.e I'm >> wondering >>> if the performance improvements is coming from this flag alone. >> >> It did cross my mind. It’s been a couple weeks since I’ve done the >> benchmarking >> but I remember that neither of the changes alone gave the improvement, only >> combined together. >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA by itself won’t trigger due to the >> logic fixed by https://gcc.gnu.org/PR116139 (it ICEd before). >> Reducing the fp_reassoc_width didn’t give any difference either. > > That's fair, from reading of the code in get_reassociation_width and how it's > used I agree > that setting it closer to the throughput makes sense. > > So there's no objections from me. I will test the other cores (new and old) > if Richard S agrees. >
Thanks Tamar. I have now pushed this to trunk. We have plenty of time to re-evaluate if we wish. Kyrill > Thanks, > Tamar > >> Thanks, >> Kyrill >> >> >>> >>> Kind Regards, >>> Tamar >>> >>>> >>>> This patch sets the fp_reassoc_width field to 4 and enables >>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2. >>>> >>>> On SPEC2017 fprate I see the following changes on a Grace system: >>>> 503.bwaves_r 0.16% >>>> 507.cactuBSSN_r -0.32% >>>> 508.namd_r 3.04% >>>> 510.parest_r 0.00% >>>> 511.povray_r 0.78% >>>> 519.lbm_r 0.35% >>>> 521.wrf_r 0.69% >>>> 526.blender_r -0.53% >>>> 527.cam4_r 0.84% >>>> 538.imagick_r 0.00% >>>> 544.nab_r -0.97% >>>> 549.fotonik3d_r -0.45% >>>> 554.roms_r 0.97% >>>> Geomean 0.35% >>>> >>>> with -Ofast -mcpu=grace -flto. >>>> >>>> So slight overall improvement with a meaningful improvement in >>>> 508.namd_r. >>>> >>>> I think other tunings in aarch64 should look into >>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the >>>> benchmarking to someone else. >>>> >>>> Tamar, Richard, does the reasoning above make sense to you? >>>> I know that FMA reassociation is something we’ve gone back and forth on in >> the >>>> backend… >>>> >>>> Thanks, >>>> Kyrill >>>> >>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width): >>>> Set to 4. >>>> (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.