Re: [PATCH 2/8]AArch64: Add Neoverse V3 core definition and cost model

Kyrylo Tkachov Tue, 03 Dec 2024 02:19:14 -0800

Hi Tamar,

Something I noticed when looking at the various tuning files….


> On 26 Jul 2024, at 11:20, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi All,
> 
> This adds a cost model and core definition for Neoverse V3.
> 
> It also makes Cortex-X4 use the Neoverse V3 cost model.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>        * config/aarch64/aarch64-cores.def (cortex-x4): Update.
>        (neoverse-v3): New.
>        * config/aarch64/aarch64-tune.md: Regenerate.
>        * config/aarch64/tuning_models/neoversev3.h: New file.
>        * config/aarch64/aarch64.cc: Use it.
>        * doc/invoke.texi: Document it.
> 
> ---
> diff --git a/gcc/config/aarch64/aarch64-cores.def 
> b/gcc/config/aarch64/aarch64-cores.def
> index 
> 34307fe0c1721dda67adab768dd22a5649687f6e..96c74657a1991acfe86d7c61af4ccce7415fabca
>  100644
> --- a/gcc/config/aarch64/aarch64-cores.def
> +++ b/gcc/config/aarch64/aarch64-cores.def
> @@ -188,13 +188,14 @@ AARCH64_CORE("cortex-x2",  cortexx2, cortexa57, V9A,  
> (SVE2_BITPERM, MEMTAG, I8M
> 
> AARCH64_CORE("cortex-x3",  cortexx3, cortexa57, V9A,  (SVE2_BITPERM, MEMTAG, 
> I8MM, BF16), neoversev2, 0x41, 0xd4e, -1)
> 
> -AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, 
> MEMTAG, PROFILE), neoversen2, 0x41, 0xd81, -1)
> +AARCH64_CORE("cortex-x4",  cortexx4, cortexa57, V9_2A,  (SVE2_BITPERM, 
> MEMTAG, PROFILE), neoversev3, 0x41, 0xd81, -1)
> 
> AARCH64_CORE("neoverse-n2", neoversen2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x41, 0xd49, -1)
> AARCH64_CORE("cobalt-100",   cobalt100, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversen2, 0x6d, 0xd49, -1)
> 
> AARCH64_CORE("neoverse-v2", neoversev2, cortexa57, V9A, (I8MM, BF16, 
> SVE2_BITPERM, RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> AARCH64_CORE("grace", grace, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> SVE2_AES, SVE2_SHA3, SVE2_SM4, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> +AARCH64_CORE("neoverse-v3", neoversev3, cortexa57, V9_2A, (SVE2_BITPERM, 
> RNG, LS64, MEMTAG, PROFILE), neoversev3, 0x41, 0xd84, -1)
> 
> AARCH64_CORE("demeter", demeter, cortexa57, V9A, (I8MM, BF16, SVE2_BITPERM, 
> RNG, MEMTAG, PROFILE), neoversev2, 0x41, 0xd4f, -1)
> 
> diff --git a/gcc/config/aarch64/aarch64-tune.md 
> b/gcc/config/aarch64/aarch64-tune.md
> index 
> 719fd3dc62a5860aad3aa92785413892e46f8816..0c3339b53e425ac36387eb63a0005a25c0c064e7
>  100644
> --- a/gcc/config/aarch64/aarch64-tune.md
> +++ b/gcc/config/aarch64/aarch64-tune.md
> @@ -1,5 +1,5 @@
> ;; -*- buffer-read-only: t -*-
> ;; Generated automatically by gentune.sh from aarch64-cores.def
> (define_attr "tune"
> -       
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,demeter,generic,generic_armv8_a,generic_armv9_a"
> +       
> "cortexa34,cortexa35,cortexa53,cortexa57,cortexa72,cortexa73,thunderx,thunderxt88,thunderxt88p1,octeontx,octeontxt81,octeontxt83,thunderxt81,thunderxt83,ampere1,ampere1a,ampere1b,emag,xgene1,falkor,qdf24xx,exynosm1,phecda,thunderx2t99p1,vulcan,thunderx2t99,cortexa55,cortexa75,cortexa76,cortexa76ae,cortexa77,cortexa78,cortexa78ae,cortexa78c,cortexa65,cortexa65ae,cortexx1,cortexx1c,neoversen1,ares,neoversee1,octeontx2,octeontx2t98,octeontx2t96,octeontx2t93,octeontx2f95,octeontx2f95n,octeontx2f95mm,a64fx,tsv110,thunderx3t110,neoversev1,zeus,neoverse512tvb,saphira,oryon1,cortexa57cortexa53,cortexa72cortexa53,cortexa73cortexa35,cortexa73cortexa53,cortexa75cortexa55,cortexa76cortexa55,cortexr82,cortexa510,cortexa520,cortexa710,cortexa715,cortexa720,cortexx2,cortexx3,cortexx4,neoversen2,cobalt100,neoversev2,grace,neoversev3,demeter,generic,generic_armv8_a,generic_armv9_a"
>        (const (symbol_ref "((enum attr_tune) aarch64_tune)")))
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> 89eb66348f772a7e94f1acde29cd4badfd51fa3d..569d4a3d16fb9846b89ebbc895cb169a6007a24a
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -413,6 +413,7 @@ static const struct aarch64_flag_desc 
> aarch64_tuning_flags[] =
> #include "tuning_models/neoverse512tvb.h"
> #include "tuning_models/neoversen2.h"
> #include "tuning_models/neoversev2.h"
> +#include "tuning_models/neoversev3.h"
> #include "tuning_models/a64fx.h"
> 
> /* Support for fine-grained override of the tuning structures.  */
> diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h 
> b/gcc/config/aarch64/tuning_models/neoversev3.h
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..3daa3d2365c817d03c6c0d5e66fe832620d8fb2c
> --- /dev/null
> +++ b/gcc/config/aarch64/tuning_models/neoversev3.h
> @@ -0,0 +1,246 @@
> +/* Tuning model description for AArch64 architecture.
> +   Copyright (C) 2009-2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef GCC_AARCH64_H_NEOVERSEV3
> +#define GCC_AARCH64_H_NEOVERSEV3
> +
> +#include "generic.h"
> +
> +static const struct cpu_addrcost_table neoversev3_addrcost_table =
> +{
> +    {
> +      1, /* hi  */
> +      0, /* si  */
> +      0, /* di  */
> +      1, /* ti  */
> +    },
> +  0, /* pre_modify  */
> +  0, /* post_modify  */
> +  2, /* post_modify_ld3_st3  */
> +  2, /* post_modify_ld4_st4  */
> +  0, /* register_offset  */
> +  0, /* register_sextend  */
> +  0, /* register_zextend  */
> +  0 /* imm_offset  */
> +};
> +
> +static const struct cpu_regmove_cost neoversev3_regmove_cost =
> +{
> +  3, /* GP2GP  */
> +  /* Spilling to int<->fp instead of memory is recommended so set
> +     realistic costs compared to memmov_cost.  */
> +  5, /* GP2FP  */
> +  4, /* FP2GP  */
> +  4 /* FP2FP  */
> +};
> +
> +static const advsimd_vec_cost neoversev3_advsimd_vector_cost =
> +{
> +  2, /* int_stmt_cost  */
> +  2, /* fp_stmt_cost  */
> +  2, /* ld2_st2_permute_cost */
> +  2, /* ld3_st3_permute_cost  */
> +  3, /* ld4_st4_permute_cost  */
> +  2, /* permute_cost  */
> +  4, /* reduc_i8_cost  */
> +  4, /* reduc_i16_cost  */
> +  2, /* reduc_i32_cost  */
> +  2, /* reduc_i64_cost  */
> +  6, /* reduc_f16_cost  */
> +  4, /* reduc_f32_cost  */
> +  2, /* reduc_f64_cost  */
> +  2, /* store_elt_extra_cost  */
> +  /* This value is just inherited from the Cortex-A57 table.  */
> +  8, /* vec_to_scalar_cost  */
> +  /* This depends very much on what the scalar value is and
> +     where it comes from.  E.g. some constants take two dependent
> +     instructions or a load, while others might be moved from a GPR.
> +     4 seems to be a reasonable compromise in practice.  */
> +  4, /* scalar_to_vec_cost  */
> +  4, /* align_load_cost  */
> +  4, /* unalign_load_cost  */
> +  /* Although stores have a latency of 2 and compete for the
> +     vector pipes, in practice it's better not to model that.  */
> +  1, /* unalign_store_cost  */
> +  1  /* store_cost  */
> +};
> +
> +static const sve_vec_cost neoversev3_sve_vector_cost =
> +{
> +  {
> +    2, /* int_stmt_cost  */
> +    2, /* fp_stmt_cost  */
> +    2, /* ld2_st2_permute_cost  */
> +    3, /* ld3_st3_permute_cost  */
> +    3, /* ld4_st4_permute_cost  */
> +    2, /* permute_cost  */
> +    /* Theoretically, a reduction involving 15 scalar ADDs could
> +       complete in ~4 cycles and would have a cost of 15.  [SU]ADDV
> +       completes in 9 cycles, so give it a cost of 15 + 5.  */
> +    20, /* reduc_i8_cost  */
> +    /* Likewise for 7 scalar ADDs (~3 cycles) vs. 8: 7 + 5.  */
> +    12, /* reduc_i16_cost  */
> +    /* Likewise for 3 scalar ADDs (~2 cycles) vs. 6: 3 + 4.  */
> +    7, /* reduc_i32_cost  */
> +    /* Likewise for 1 scalar ADDs (~1 cycles) vs. 2: 1 + 1.  */
> +    2, /* reduc_i64_cost  */
> +    /* Theoretically, a reduction involving 7 scalar FADDs could
> +       complete in ~6 cycles and would have a cost of  7.  FADDV
> +       completes in 8 cycles, so give it a cost of 7 + 2.  */
> +    9, /* reduc_f16_cost  */
> +    /* Likewise for 3 scalar FADDs (~4 cycles) vs. 6: 3 + 2.  */
> +    5, /* reduc_f32_cost  */
> +    /* Likewise for 1 scalar FADD (~2 cycles) vs. 4: 1 + 2.  */
> +    3, /* reduc_f64_cost  */
> +    2, /* store_elt_extra_cost  */
> +    /* This value is just inherited from the Cortex-A57 table.  */
> +    8, /* vec_to_scalar_cost  */
> +    /* See the comment above the Advanced SIMD versions.  */
> +    4, /* scalar_to_vec_cost  */
> +    4, /* align_load_cost  */
> +    4, /* unalign_load_cost  */
> +    /* Although stores have a latency of 2 and compete for the
> +       vector pipes, in practice it's better not to model that.  */
> +    1, /* unalign_store_cost  */
> +    1  /* store_cost  */
> +  },
> +  3, /* clast_cost  */
> +  10, /* fadda_f16_cost  */
> +  6, /* fadda_f32_cost  */
> +  4, /* fadda_f64_cost  */
> +  /* A strided Advanced SIMD x64 load would take two parallel FP loads
> +     (8 cycles) plus an insertion (2 cycles).  Assume a 64-bit SVE gather
> +     is 1 cycle more.  The Advanced SIMD version is costed as 2 scalar loads
> +     (cost 8) and a vec_construct (cost 4).  Add a full vector operation
> +     (cost 2) to that, to avoid the difference being lost in rounding.
> +
> +     There is no easy comparison between a strided Advanced SIMD x32 load
> +     and an SVE 32-bit gather, but cost an SVE 32-bit gather as 1 vector
> +     operation more than a 64-bit gather.  */
> +  14, /* gather_load_x32_cost  */
> +  12, /* gather_load_x64_cost  */
> +  1 /* scatter_store_elt_cost  */
> +};
> +
> +static const aarch64_scalar_vec_issue_info neoversev3_scalar_issue_info =
> +{
> +  3, /* loads_stores_per_cycle  */
> +  2, /* stores_per_cycle  */
> +  8, /* general_ops_per_cycle  */
> +  0, /* fp_simd_load_general_ops  */
> +  1 /* fp_simd_store_general_ops  */
> +};
> +
> +static const aarch64_advsimd_vec_issue_info neoversev3_advsimd_issue_info =
> +{
> +  {
> +    0, /* loads_stores_per_cycle  */
> +    1, /* stores_per_cycle  */
> +    4, /* general_ops_per_cycle  */
> +    0, /* fp_simd_load_general_ops  */
> +    1 /* fp_simd_store_general_ops  */
> +  },
> +  2, /* ld2_st2_general_ops  */
> +  2, /* ld3_st3_general_ops  */
> +  3 /* ld4_st4_general_ops  */
> +};
> +
> +static const aarch64_sve_vec_issue_info neoversev3_sve_issue_info =
> +{
> +  {
> +    {
> +      0, /* loads_stores_per_cycle  */
> +      1, /* stores_per_cycle  */
> +      4, /* general_ops_per_cycle  */
> +      0, /* fp_simd_load_general_ops  */
> +      1 /* fp_simd_store_general_ops  */
> +    },
> +    2, /* ld2_st2_general_ops  */
> +    2, /* ld3_st3_general_ops  */
> +    3 /* ld4_st4_general_ops  */
> +  },
> +  2, /* pred_ops_per_cycle  */
> +  1, /* while_pred_ops  */
> +  0, /* int_cmp_pred_ops  */
> +  0, /* fp_cmp_pred_ops  */
> +  1, /* gather_scatter_pair_general_ops  */
> +  1 /* gather_scatter_pair_pred_ops  */
> +};
> +
> +static const aarch64_vec_issue_info neoversev3_vec_issue_info =
> +{
> +  &neoversev3_scalar_issue_info,
> +  &neoversev3_advsimd_issue_info,
> +  &neoversev3_sve_issue_info
> +};
> +
> +/* Neoversev3 costs for vector insn classes.  */
> +static const struct cpu_vector_cost neoversev3_vector_cost =
> +{
> +  1, /* scalar_int_stmt_cost  */
> +  2, /* scalar_fp_stmt_cost  */
> +  4, /* scalar_load_cost  */
> +  1, /* scalar_store_cost  */
> +  1, /* cond_taken_branch_cost  */
> +  1, /* cond_not_taken_branch_cost  */
> +  &neoversev3_advsimd_vector_cost, /* advsimd  */
> +  &neoversev3_sve_vector_cost, /* sve  */
> +  &neoversev3_vec_issue_info /* issue_info  */
> +};
> +
> +static const struct tune_params neoversev3_tunings =
> +{
> +  &cortexa76_extra_costs,
> +  &neoversev3_addrcost_table,
> +  &neoversev3_regmove_cost,
> +  &neoversev3_vector_cost,
> +  &generic_branch_cost,
> +  &generic_approx_modes,
> +  SVE_128, /* sve_width  */
> +  { 4, /* load_int.  */
> +    2, /* store_int.  */
> +    6, /* load_fp.  */
> +    1, /* store_fp.  */
> +    6, /* load_pred.  */
> +    2 /* store_pred.  */
> +  }, /* memmov_cost.  */
> +  10, /* issue_rate  */
> +  (AARCH64_FUSE_AES_AESMC | AARCH64_FUSE_CMP_BRANCH), /* fusible_ops  */

…. According to the Neoverse V3 SWOG the core also supports fusion of CMP+CSEL 
and CMP+CSET.
We enabled these settings for Neoverse V2 with AARCH64_FUSE_CMP_CSEL and 
AARCH64_FUSE_CMP_CSET.
Jennifer added the logic for them presumably after you had written this patch 
internally, so they wouldn’t have existed earlier in the year.
You may want to consider adding them to this tuning, as well as the V3AE one.

Thanks,
Kyrill


> +  "32:16",     /* function_align.  */
> +  "4",         /* jump_align.  */
> +  "32:16",     /* loop_align.  */
> +  4,   /* int_reassoc_width.  */
> +  6,   /* fp_reassoc_width.  */
> +  4,   /* fma_reassoc_width.  */
> +  3,   /* vec_reassoc_width.  */
> +  2,   /* min_div_recip_mul_sf.  */
> +  2,   /* min_div_recip_mul_df.  */
> +  0,   /* max_case_values.  */
> +  tune_params::AUTOPREFETCHER_WEAK,    /* autoprefetcher_model.  */
> +  (AARCH64_EXTRA_TUNE_CHEAP_SHIFT_EXTEND
> +   | AARCH64_EXTRA_TUNE_CSE_SVE_VL_CONSTANTS
> +   | AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
> +   | AARCH64_EXTRA_TUNE_MATCHED_VECTOR_THROUGHPUT
> +   | AARCH64_EXTRA_TUNE_AVOID_PRED_RMW),       /* tune_flags.  */
> +  &generic_prefetch_tune,
> +  AARCH64_LDP_STP_POLICY_ALWAYS,   /* ldp_policy_model.  */
> +  AARCH64_LDP_STP_POLICY_ALWAYS           /* stp_policy_model.  */
> +};
> +
> +#endif /* GCC_AARCH64_H_NEOVERSEV3.  */
> \ No newline at end of file
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 
> 403ea9da1abd5a012d0b18849852604b10689682..ffcf4f146d92d410c6b515b3b80f07bdec1d2b55
>  100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -21524,6 +21524,7 @@ performance of the code.  Permissible values for this 
> option are:
> @samp{oryon-1},
> @samp{neoverse-512tvb}, @samp{neoverse-e1}, @samp{neoverse-n1},
> @samp{neoverse-n2}, @samp{neoverse-v1}, @samp{neoverse-v2}, @samp{grace},
> +@samp{neoverse-v3},
> @samp{qdf24xx}, @samp{saphira}, @samp{phecda}, @samp{xgene1}, @samp{vulcan},
> @samp{octeontx}, @samp{octeontx81},  @samp{octeontx83},
> @samp{octeontx2}, @samp{octeontx2t98}, @samp{octeontx2t96}
> 
> 
> 
> 
> --
> <rb18665.patch>

Re: [PATCH 2/8]AArch64: Add Neoverse V3 core definition and cost model

Reply via email to