Hi Tamar, > On 12 Aug 2024, at 16:48, Tamar Christina <tamar.christ...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Hi Kyrill, > >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Monday, August 12, 2024 3:07 PM >> To: GCC Patches <gcc-patches@gcc.gnu.org> >> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford >> <richard.sandif...@arm.com> >> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse V2 >> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA >> >> Hi all, >> >> The fp reassociation width for Neoverse V2 was set to 6 since its >> introduction and I guess it was empirically tuned. But since >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree reassociation >> pass seems to be more deliberate in forming FMAs and when that flag is >> used it seems to more properly evaluate the FMA vs non-FMA reassociation >> widths. > > Thanks!, evaluating this flag has been on our list for a long while now.. > >> According to the Neoverse V2 SWOG the core has a throughput of 4 for >> most FP operations, so the value 6 is not accurate from first principles.. >> Also, the SWOG does state that FMADD operations are pipelined and the >> results can be forwarded from FP multiplies to the accumulation operands >> of FMADD instructions, which seems to be what >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses. > > Yeah, We've always overestimated the fp reassociation width I think because > of the > FMA forming pass. > > My understanding was that the pass has historically been a bit unreliable and > that > the overestimation was to cover up this. But over the years a lot have > changed. > > So I don't think historically it was based on the actual throughput on the > core. > e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have > a two FP units. > > That said from first principals I agree with the change, but I'm curious as > to whether you've > also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA. i.e I'm > wondering > if the performance improvements is coming from this flag alone.
It did cross my mind. It’s been a couple weeks since I’ve done the benchmarking but I remember that neither of the changes alone gave the improvement, only combined together. AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA by itself won’t trigger due to the logic fixed by https://gcc.gnu.org/PR116139 (it ICEd before). Reducing the fp_reassoc_width didn’t give any difference either. Thanks, Kyrill > > Kind Regards, > Tamar > >> >> This patch sets the fp_reassoc_width field to 4 and enables >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2. >> >> On SPEC2017 fprate I see the following changes on a Grace system: >> 503.bwaves_r 0.16% >> 507.cactuBSSN_r -0.32% >> 508.namd_r 3.04% >> 510.parest_r 0.00% >> 511.povray_r 0.78% >> 519.lbm_r 0.35% >> 521.wrf_r 0.69% >> 526.blender_r -0.53% >> 527.cam4_r 0.84% >> 538.imagick_r 0.00% >> 544.nab_r -0.97% >> 549.fotonik3d_r -0.45% >> 554.roms_r 0.97% >> Geomean 0.35% >> >> with -Ofast -mcpu=grace -flto. >> >> So slight overall improvement with a meaningful improvement in >> 508.namd_r. >> >> I think other tunings in aarch64 should look into >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the >> benchmarking to someone else. >> >> Tamar, Richard, does the reasoning above make sense to you? >> I know that FMA reassociation is something we’ve gone back and forth on in >> the >> backend… >> >> Thanks, >> Kyrill >> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> >> >> gcc/ChangeLog: >> >> * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width): >> Set to 4. >> (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA. >