Hi Kyrill, > -----Original Message----- > From: Kyrylo Tkachov <ktkac...@nvidia.com> > Sent: Monday, August 12, 2024 3:07 PM > To: GCC Patches <gcc-patches@gcc.gnu.org> > Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford > <richard.sandif...@arm.com> > Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse V2 > and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA > > Hi all, > > The fp reassociation width for Neoverse V2 was set to 6 since its > introduction and I guess it was empirically tuned. But since > AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree reassociation > pass seems to be more deliberate in forming FMAs and when that flag is > used it seems to more properly evaluate the FMA vs non-FMA reassociation > widths.
Thanks!, evaluating this flag has been on our list for a long while now.. > According to the Neoverse V2 SWOG the core has a throughput of 4 for > most FP operations, so the value 6 is not accurate from first principles.. > Also, the SWOG does state that FMADD operations are pipelined and the > results can be forwarded from FP multiplies to the accumulation operands > of FMADD instructions, which seems to be what > AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses. Yeah, We've always overestimated the fp reassociation width I think because of the FMA forming pass. My understanding was that the pass has historically been a bit unreliable and that the overestimation was to cover up this. But over the years a lot have changed. So I don't think historically it was based on the actual throughput on the core. e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have a two FP units. That said from first principals I agree with the change, but I'm curious as to whether you've also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA. i.e I'm wondering if the performance improvements is coming from this flag alone. Kind Regards, Tamar > > This patch sets the fp_reassoc_width field to 4 and enables > AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2. > > On SPEC2017 fprate I see the following changes on a Grace system: > 503.bwaves_r 0.16% > 507.cactuBSSN_r -0.32% > 508.namd_r 3.04% > 510.parest_r 0.00% > 511.povray_r 0.78% > 519.lbm_r 0.35% > 521.wrf_r 0.69% > 526.blender_r -0.53% > 527.cam4_r 0.84% > 538.imagick_r 0.00% > 544.nab_r -0.97% > 549.fotonik3d_r -0.45% > 554.roms_r 0.97% > Geomean 0.35% > > with -Ofast -mcpu=grace -flto. > > So slight overall improvement with a meaningful improvement in > 508.namd_r. > > I think other tunings in aarch64 should look into > AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the > benchmarking to someone else. > > Tamar, Richard, does the reasoning above make sense to you? > I know that FMA reassociation is something we’ve gone back and forth on in the > backend… > > Thanks, > Kyrill > > Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> > > gcc/ChangeLog: > > * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width): > Set to 4. > (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.