Hi Kyrill,

> -----Original Message-----
> From: Kyrylo Tkachov <ktkac...@nvidia.com>
> Sent: Monday, August 12, 2024 3:07 PM
> To: GCC Patches <gcc-patches@gcc.gnu.org>
> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
> <richard.sandif...@arm.com>
> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse V2
> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA
> 
> Hi all,
> 
> The fp reassociation width for Neoverse V2 was set to 6 since its
> introduction and I guess it was empirically tuned.  But since
> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree reassociation
> pass seems to be more deliberate in forming FMAs and when that flag is
> used it seems to more properly evaluate the FMA vs non-FMA reassociation
> widths.

Thanks!, evaluating this flag has been on our list for a long while now..

> According to the Neoverse V2 SWOG the core has a throughput of 4 for
> most FP operations, so the value 6 is not accurate from first principles..
> Also, the SWOG does state that FMADD operations are pipelined and the
> results can be forwarded from FP multiplies to the accumulation operands
> of FMADD instructions, which seems to be what
> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses.

Yeah, We've always overestimated the fp reassociation width I think because of 
the
FMA forming pass.

My understanding was that the pass has historically been a bit unreliable and 
that
the overestimation was to cover up this.  But over the years a lot have changed.

So I don't think historically it was based on the actual throughput on the core.
e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have
a two FP units.

That said from first principals I agree with the change, but I'm curious as to 
whether you've
also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.  i.e I'm 
wondering
if the performance improvements is coming from this flag alone.

Kind Regards,
Tamar

> 
> This patch sets the fp_reassoc_width field to 4 and enables
> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2.
> 
> On SPEC2017 fprate I see the following changes on a Grace system:
> 503.bwaves_r  0.16%
> 507.cactuBSSN_r       -0.32%
> 508.namd_r    3.04%
> 510.parest_r  0.00%
> 511.povray_r  0.78%
> 519.lbm_r     0.35%
> 521.wrf_r     0.69%
> 526.blender_r -0.53%
> 527.cam4_r    0.84%
> 538.imagick_r 0.00%
> 544.nab_r     -0.97%
> 549.fotonik3d_r       -0.45%
> 554.roms_r    0.97%
> Geomean               0.35%
> 
> with -Ofast -mcpu=grace -flto.
> 
> So slight overall improvement with a meaningful improvement in
> 508.namd_r.
> 
> I think other tunings in aarch64 should look into
> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the
> benchmarking to someone else.
> 
> Tamar, Richard, does the reasoning above make sense to you?
> I know that FMA reassociation is something we’ve gone back and forth on in the
> backend…
> 
> Thanks,
> Kyrill
> 
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
> 
> gcc/ChangeLog:
> 
>       * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width):
>       Set to 4.
>       (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.

Reply via email to