Hi Tamar,

> On 12 Aug 2024, at 16:48, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Kyrill,
> 
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Monday, August 12, 2024 3:07 PM
>> To: GCC Patches <gcc-patches@gcc.gnu.org>
>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
>> <richard.sandif...@arm.com>
>> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse V2
>> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA
>> 
>> Hi all,
>> 
>> The fp reassociation width for Neoverse V2 was set to 6 since its
>> introduction and I guess it was empirically tuned.  But since
>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree reassociation
>> pass seems to be more deliberate in forming FMAs and when that flag is
>> used it seems to more properly evaluate the FMA vs non-FMA reassociation
>> widths.
> 
> Thanks!, evaluating this flag has been on our list for a long while now..
> 
>> According to the Neoverse V2 SWOG the core has a throughput of 4 for
>> most FP operations, so the value 6 is not accurate from first principles..
>> Also, the SWOG does state that FMADD operations are pipelined and the
>> results can be forwarded from FP multiplies to the accumulation operands
>> of FMADD instructions, which seems to be what
>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses.
> 
> Yeah, We've always overestimated the fp reassociation width I think because 
> of the
> FMA forming pass.
> 
> My understanding was that the pass has historically been a bit unreliable and 
> that
> the overestimation was to cover up this.  But over the years a lot have 
> changed.
> 
> So I don't think historically it was based on the actual throughput on the 
> core.
> e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have
> a two FP units.
> 
> That said from first principals I agree with the change, but I'm curious as 
> to whether you've
> also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.  i.e I'm 
> wondering
> if the performance improvements is coming from this flag alone.

It did cross my mind. It’s been a couple weeks since I’ve done the benchmarking 
but I remember that neither of the changes alone gave the improvement, only 
combined together.
AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA by itself won’t trigger due to the logic 
fixed by https://gcc.gnu.org/PR116139 (it ICEd before).
Reducing the fp_reassoc_width didn’t give any difference either.
Thanks,
Kyrill


> 
> Kind Regards,
> Tamar
> 
>> 
>> This patch sets the fp_reassoc_width field to 4 and enables
>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2.
>> 
>> On SPEC2017 fprate I see the following changes on a Grace system:
>> 503.bwaves_r  0.16%
>> 507.cactuBSSN_r       -0.32%
>> 508.namd_r    3.04%
>> 510.parest_r  0.00%
>> 511.povray_r  0.78%
>> 519.lbm_r     0.35%
>> 521.wrf_r     0.69%
>> 526.blender_r -0.53%
>> 527.cam4_r    0.84%
>> 538.imagick_r 0.00%
>> 544.nab_r     -0.97%
>> 549.fotonik3d_r       -0.45%
>> 554.roms_r    0.97%
>> Geomean               0.35%
>> 
>> with -Ofast -mcpu=grace -flto.
>> 
>> So slight overall improvement with a meaningful improvement in
>> 508.namd_r.
>> 
>> I think other tunings in aarch64 should look into
>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the
>> benchmarking to someone else.
>> 
>> Tamar, Richard, does the reasoning above make sense to you?
>> I know that FMA reassociation is something we’ve gone back and forth on in 
>> the
>> backend…
>> 
>> Thanks,
>> Kyrill
>> 
>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>> 
>> gcc/ChangeLog:
>> 
>>      * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width):
>>      Set to 4.
>>      (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.
> 

Reply via email to