> -----Original Message-----
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Sent: Monday, August 12, 2024 3:54 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
> <richard.sandif...@arm.com>
> Subject: Re: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse
> V2 and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA
> 
> Hi Tamar,
> 
> > On 12 Aug 2024, at 16:48, Tamar Christina <tamar.christ...@arm.com> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > Hi Kyrill,
> >
> >> -----Original Message-----
> >> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> >> Sent: Monday, August 12, 2024 3:07 PM
> >> To: GCC Patches <gcc-patches@gcc.gnu.org>
> >> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
> >> <richard.sandif...@arm.com>
> >> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse 
> >> V2
> >> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA
> >>
> >> Hi all,
> >>
> >> The fp reassociation width for Neoverse V2 was set to 6 since its
> >> introduction and I guess it was empirically tuned.  But since
> >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree
> reassociation
> >> pass seems to be more deliberate in forming FMAs and when that flag is
> >> used it seems to more properly evaluate the FMA vs non-FMA reassociation
> >> widths.
> >
> > Thanks!, evaluating this flag has been on our list for a long while now..
> >
> >> According to the Neoverse V2 SWOG the core has a throughput of 4 for
> >> most FP operations, so the value 6 is not accurate from first principles..
> >> Also, the SWOG does state that FMADD operations are pipelined and the
> >> results can be forwarded from FP multiplies to the accumulation operands
> >> of FMADD instructions, which seems to be what
> >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses.
> >
> > Yeah, We've always overestimated the fp reassociation width I think because 
> > of
> the
> > FMA forming pass.
> >
> > My understanding was that the pass has historically been a bit unreliable 
> > and
> that
> > the overestimation was to cover up this.  But over the years a lot have 
> > changed.
> >
> > So I don't think historically it was based on the actual throughput on the 
> > core.
> > e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have
> > a two FP units.
> >
> > That said from first principals I agree with the change, but I'm curious as 
> > to
> whether you've
> > also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.  i.e I'm
> wondering
> > if the performance improvements is coming from this flag alone.
> 
> It did cross my mind. It’s been a couple weeks since I’ve done the 
> benchmarking
> but I remember that neither of the changes alone gave the improvement, only
> combined together.
> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA by itself won’t trigger due to the
> logic fixed by https://gcc.gnu.org/PR116139 (it ICEd before).
> Reducing the fp_reassoc_width didn’t give any difference either.

That's fair, from reading of the code in get_reassociation_width and how it's 
used I agree
that setting it closer to the throughput makes sense.

So there's no objections from me.  I will test the other cores (new and old) if 
Richard S agrees.

Thanks,
Tamar

> Thanks,
> Kyrill
> 
> 
> >
> > Kind Regards,
> > Tamar
> >
> >>
> >> This patch sets the fp_reassoc_width field to 4 and enables
> >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2.
> >>
> >> On SPEC2017 fprate I see the following changes on a Grace system:
> >> 503.bwaves_r  0.16%
> >> 507.cactuBSSN_r       -0.32%
> >> 508.namd_r    3.04%
> >> 510.parest_r  0.00%
> >> 511.povray_r  0.78%
> >> 519.lbm_r     0.35%
> >> 521.wrf_r     0.69%
> >> 526.blender_r -0.53%
> >> 527.cam4_r    0.84%
> >> 538.imagick_r 0.00%
> >> 544.nab_r     -0.97%
> >> 549.fotonik3d_r       -0.45%
> >> 554.roms_r    0.97%
> >> Geomean               0.35%
> >>
> >> with -Ofast -mcpu=grace -flto.
> >>
> >> So slight overall improvement with a meaningful improvement in
> >> 508.namd_r.
> >>
> >> I think other tunings in aarch64 should look into
> >> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the
> >> benchmarking to someone else.
> >>
> >> Tamar, Richard, does the reasoning above make sense to you?
> >> I know that FMA reassociation is something we’ve gone back and forth on in
> the
> >> backend…
> >>
> >> Thanks,
> >> Kyrill
> >>
> >> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> >>
> >> gcc/ChangeLog:
> >>
> >>      * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width):
> >>      Set to 4.
> >>      (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.
> >

Reply via email to