> On 12 Aug 2024, at 17:24, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Monday, August 12, 2024 3:54 PM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Richard Sandiford
>> <richard.sandif...@arm.com>
>> Subject: Re: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse
>> V2 and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA
>> 
>> Hi Tamar,
>> 
>>> On 12 Aug 2024, at 16:48, Tamar Christina <tamar.christ...@arm.com> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> Hi Kyrill,
>>> 
>>>> -----Original Message-----
>>>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>> Sent: Monday, August 12, 2024 3:07 PM
>>>> To: GCC Patches <gcc-patches@gcc.gnu.org>
>>>> Cc: Tamar Christina <tamar.christ...@arm.com>; Richard Sandiford
>>>> <richard.sandif...@arm.com>
>>>> Subject: [PATCH][RFC] aarch64: Reduce FP reassociation width for Neoverse 
>>>> V2
>>>> and set AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA
>>>> 
>>>> Hi all,
>>>> 
>>>> The fp reassociation width for Neoverse V2 was set to 6 since its
>>>> introduction and I guess it was empirically tuned.  But since
>>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA was added the tree
>> reassociation
>>>> pass seems to be more deliberate in forming FMAs and when that flag is
>>>> used it seems to more properly evaluate the FMA vs non-FMA reassociation
>>>> widths.
>>> 
>>> Thanks!, evaluating this flag has been on our list for a long while now..
>>> 
>>>> According to the Neoverse V2 SWOG the core has a throughput of 4 for
>>>> most FP operations, so the value 6 is not accurate from first principles..
>>>> Also, the SWOG does state that FMADD operations are pipelined and the
>>>> results can be forwarded from FP multiplies to the accumulation operands
>>>> of FMADD instructions, which seems to be what
>>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA expresses.
>>> 
>>> Yeah, We've always overestimated the fp reassociation width I think because 
>>> of
>> the
>>> FMA forming pass.
>>> 
>>> My understanding was that the pass has historically been a bit unreliable 
>>> and
>> that
>>> the overestimation was to cover up this.  But over the years a lot have 
>>> changed.
>>> 
>>> So I don't think historically it was based on the actual throughput on the 
>>> core.
>>> e.g. Neoverse N1 and Cortex-A57 have a width of 4 set, even though they have
>>> a two FP units.
>>> 
>>> That said from first principals I agree with the change, but I'm curious as 
>>> to
>> whether you've
>>> also tested it just with AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.  i.e I'm
>> wondering
>>> if the performance improvements is coming from this flag alone.
>> 
>> It did cross my mind. It’s been a couple weeks since I’ve done the 
>> benchmarking
>> but I remember that neither of the changes alone gave the improvement, only
>> combined together.
>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA by itself won’t trigger due to the
>> logic fixed by https://gcc.gnu.org/PR116139 (it ICEd before).
>> Reducing the fp_reassoc_width didn’t give any difference either.
> 
> That's fair, from reading of the code in get_reassociation_width and how it's 
> used I agree
> that setting it closer to the throughput makes sense.
> 
> So there's no objections from me.  I will test the other cores (new and old) 
> if Richard S agrees.
> 

Thanks Tamar. I have now pushed this to trunk.
We have plenty of time to re-evaluate if we wish.
Kyrill


> Thanks,
> Tamar
> 
>> Thanks,
>> Kyrill
>> 
>> 
>>> 
>>> Kind Regards,
>>> Tamar
>>> 
>>>> 
>>>> This patch sets the fp_reassoc_width field to 4 and enables
>>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA for -mcpu=neoverse-v2.
>>>> 
>>>> On SPEC2017 fprate I see the following changes on a Grace system:
>>>> 503.bwaves_r  0.16%
>>>> 507.cactuBSSN_r       -0.32%
>>>> 508.namd_r    3.04%
>>>> 510.parest_r  0.00%
>>>> 511.povray_r  0.78%
>>>> 519.lbm_r     0.35%
>>>> 521.wrf_r     0.69%
>>>> 526.blender_r -0.53%
>>>> 527.cam4_r    0.84%
>>>> 538.imagick_r 0.00%
>>>> 544.nab_r     -0.97%
>>>> 549.fotonik3d_r       -0.45%
>>>> 554.roms_r    0.97%
>>>> Geomean               0.35%
>>>> 
>>>> with -Ofast -mcpu=grace -flto.
>>>> 
>>>> So slight overall improvement with a meaningful improvement in
>>>> 508.namd_r.
>>>> 
>>>> I think other tunings in aarch64 should look into
>>>> AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA as well, but I'll leave the
>>>> benchmarking to someone else.
>>>> 
>>>> Tamar, Richard, does the reasoning above make sense to you?
>>>> I know that FMA reassociation is something we’ve gone back and forth on in
>> the
>>>> backend…
>>>> 
>>>> Thanks,
>>>> Kyrill
>>>> 
>>>> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>>     * config/aarch64/tuning_models/neoversev2.h (fp_reassoc_width):
>>>>     Set to 4.
>>>>     (tune_flags): Add AARCH64_EXTRA_TUNE_FULLY_PIPELINED_FMA.


Reply via email to