Στις Πέμ 7 Δεκ 2023, 09:39 ο χρήστης Richard Biener < richard.guent...@gmail.com> έγραψε:
> On Wed, Dec 6, 2023 at 6:15 PM Manos Anagnostakis > <manos.anagnosta...@vrull.eu> wrote: > > > > Hi again, > > > > I went and tested the requested changes and found out the following: > > > > 1. The pass is currently increasing insn_cnt on a NONJUMP_INSN_P, which > is a subset of NONDEBUG_INSN_P. I think there is no problem with depending > on -g with the current version. Do you see something I don't or did you > mean something else? > > It just occurred to me - thanks for double-checking (it wasn't obvious > to me NONJUMP_INSN_P doesn't include DEBUG_INSNs ...). > > > 2. Not processing all instructions is not letting cselib record all the > effects they have, thus it does not have updated information to find true > forwardings at any given time. I can confirm this since I am witnessing > many unexpected changes on the number of handled cases if I do this only > for loads/stores. > > Ah, OK. I guess I don't fully get it, it seems you use cselib to > compare addresses and while possibly not > processing part of the address computation might break this other > stmts inbetween should be uninteresting > (at least I don't see you handling intermediate may-aliasing [small] > stores to disable the splitting). > > So in the end it's a compile-time trade-off between relying solely on > cselib or trying to optimize > cselib use with DF for address computes? > > Richard. > I am not really familiar with the DF technique, but what we did is the following: At first we were using a combination of alias analysis with cselib, which while trying to avoid false positives that had their memory-register address changed inbetween, was rejecting part of stores from being considered as candidates for a forwarding to an ldp. Thus in the end using only the cselib with the correct lookups was the way to address both the register difference and/or the intermediate change on the address value. > > > Thanks in advance and please let me know your thoughts on the above. > > Manos. > > > > On Wed, Dec 6, 2023 at 5:10 PM Manos Anagnostakis < > manos.anagnosta...@vrull.eu> wrote: > >> > >> Hi Richard, > >> > >> thanks for the useful comments. > >> > >> On Wed, Dec 6, 2023 at 4:32 PM Richard Biener < > richard.guent...@gmail.com> wrote: > >>> > >>> On Wed, Dec 6, 2023 at 2:48 PM Manos Anagnostakis > >>> <manos.anagnosta...@vrull.eu> wrote: > >>> > > >>> > This is an RTL pass that detects store forwarding from stores to > larger loads (load pairs). > >>> > > >>> > This optimization is SPEC2017-driven and was found to be beneficial > for some benchmarks, > >>> > through testing on ampere1/ampere1a machines. > >>> > > >>> > For example, it can transform cases like > >>> > > >>> > str d5, [sp, #320] > >>> > fmul d5, d31, d29 > >>> > ldp d31, d17, [sp, #312] # Large load from small store > >>> > > >>> > to > >>> > > >>> > str d5, [sp, #320] > >>> > fmul d5, d31, d29 > >>> > ldr d31, [sp, #312] > >>> > ldr d17, [sp, #320] > >>> > > >>> > Currently, the pass is disabled by default on all architectures and > enabled by a target-specific option. > >>> > > >>> > If deemed beneficial enough for a default, it will be enabled on > ampere1/ampere1a, > >>> > or other architectures as well, without needing to be turned on by > this option. > >>> > >>> What is aarch64-specific about the pass? > >> > >> The pass was designed to target load pairs, which are aarch64 specific, > thus it cannot handle generic loads. > >>> > >>> > >>> I see an increasingly large number of target specific passes pop up > (probably > >>> for the excuse we can generalize them if necessary). But GCC isn't > LLVM > >>> and this feels like getting out of hand? > >>> > >>> The x86 backend also has its store-forwarding "pass" as part of mdreorg > >>> in ix86_split_stlf_stall_load. > >>> > >>> Richard. > >>> > >>> > Bootstrapped and regtested on aarch64-linux. > >>> > > >>> > gcc/ChangeLog: > >>> > > >>> > * config.gcc: Add aarch64-store-forwarding.o to extra_objs. > >>> > * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New > pass. > >>> > * config/aarch64/aarch64-protos.h > (make_pass_avoid_store_forwarding): Declare. > >>> > * config/aarch64/aarch64.opt (mavoid-store-forwarding): New > option. > >>> > (aarch64-store-forwarding-threshold): New param. > >>> > * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o > >>> > * doc/invoke.texi: Document new option and new param. > >>> > * config/aarch64/aarch64-store-forwarding.cc: New file. > >>> > > >>> > gcc/testsuite/ChangeLog: > >>> > > >>> > * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test. > >>> > * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test. > >>> > * gcc.target/aarch64/ldp_ssll_overlap.c: New test. > >>> > > >>> > Signed-off-by: Manos Anagnostakis <manos.anagnosta...@vrull.eu> > >>> > Co-Authored-By: Manolis Tsamis <manolis.tsa...@vrull.eu> > >>> > Co-Authored-By: Philipp Tomsich <philipp.toms...@vrull.eu> > >>> > --- > >>> > Changes in v6: > >>> > - An obvious change. insn_cnt was incremented only on > >>> > stores and not for every insn in the bb. Now restored. > >>> > > >>> > gcc/config.gcc | 1 + > >>> > gcc/config/aarch64/aarch64-passes.def | 1 + > >>> > gcc/config/aarch64/aarch64-protos.h | 1 + > >>> > .../aarch64/aarch64-store-forwarding.cc | 318 > ++++++++++++++++++ > >>> > gcc/config/aarch64/aarch64.opt | 9 + > >>> > gcc/config/aarch64/t-aarch64 | 10 + > >>> > gcc/doc/invoke.texi | 11 +- > >>> > .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++ > >>> > .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++ > >>> > .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++ > >>> > 10 files changed, 449 insertions(+), 1 deletion(-) > >>> > create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc > >>> > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > >>> > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > >>> > create mode 100644 > gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > >>> > > >>> > diff --git a/gcc/config.gcc b/gcc/config.gcc > >>> > index 6450448f2f0..7c48429eb82 100644 > >>> > --- a/gcc/config.gcc > >>> > +++ b/gcc/config.gcc > >>> > @@ -350,6 +350,7 @@ aarch64*-*-*) > >>> > cxx_target_objs="aarch64-c.o" > >>> > d_target_objs="aarch64-d.o" > >>> > extra_objs="aarch64-builtins.o aarch-common.o > aarch64-sve-builtins.o aarch64-sve-builtins-shapes.o > aarch64-sve-builtins-base.o aarch64-sve-builtins-sve2.o > aarch64-sve-builtins-sme.o cortex-a57-fma-steering.o aarch64-speculation.o > falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o" > >>> > + extra_objs="${extra_objs} aarch64-store-forwarding.o" > >>> > > target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc > \$(srcdir)/config/aarch64/aarch64-sve-builtins.h > \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc" > >>> > target_has_targetm_common=yes > >>> > ;; > >>> > diff --git a/gcc/config/aarch64/aarch64-passes.def > b/gcc/config/aarch64/aarch64-passes.def > >>> > index 662a13fd5e6..94ced0aebf6 100644 > >>> > --- a/gcc/config/aarch64/aarch64-passes.def > >>> > +++ b/gcc/config/aarch64/aarch64-passes.def > >>> > @@ -24,3 +24,4 @@ INSERT_PASS_BEFORE > (pass_late_thread_prologue_and_epilogue, 1, pass_switch_pstat > >>> > INSERT_PASS_AFTER (pass_machine_reorg, 1, > pass_tag_collision_avoidance); > >>> > INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti); > >>> > INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion); > >>> > +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding); > >>> > diff --git a/gcc/config/aarch64/aarch64-protos.h > b/gcc/config/aarch64/aarch64-protos.h > >>> > index 60ff61f6d54..8f5f2ca4710 100644 > >>> > --- a/gcc/config/aarch64/aarch64-protos.h > >>> > +++ b/gcc/config/aarch64/aarch64-protos.h > >>> > @@ -1069,6 +1069,7 @@ rtl_opt_pass > *make_pass_tag_collision_avoidance (gcc::context *); > >>> > rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt); > >>> > rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt); > >>> > rtl_opt_pass *make_pass_switch_pstate_sm (gcc::context *ctxt); > >>> > +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt); > >>> > > >>> > poly_uint64 aarch64_regmode_natural_size (machine_mode); > >>> > > >>> > diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc > b/gcc/config/aarch64/aarch64-store-forwarding.cc > >>> > new file mode 100644 > >>> > index 00000000000..8a6faefd8c0 > >>> > --- /dev/null > >>> > +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc > >>> > @@ -0,0 +1,318 @@ > >>> > +/* Avoid store forwarding optimization pass. > >>> > + Copyright (C) 2023 Free Software Foundation, Inc. > >>> > + Contributed by VRULL GmbH. > >>> > + > >>> > + This file is part of GCC. > >>> > + > >>> > + GCC is free software; you can redistribute it and/or modify it > >>> > + under the terms of the GNU General Public License as published by > >>> > + the Free Software Foundation; either version 3, or (at your > option) > >>> > + any later version. > >>> > + > >>> > + GCC is distributed in the hope that it will be useful, but > >>> > + WITHOUT ANY WARRANTY; without even the implied warranty of > >>> > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > >>> > + General Public License for more details. > >>> > + > >>> > + You should have received a copy of the GNU General Public License > >>> > + along with GCC; see the file COPYING3. If not see > >>> > + <http://www.gnu.org/licenses/>. */ > >>> > + > >>> > +#define IN_TARGET_CODE 1 > >>> > + > >>> > +#include "config.h" > >>> > +#define INCLUDE_LIST > >>> > +#include "system.h" > >>> > +#include "coretypes.h" > >>> > +#include "backend.h" > >>> > +#include "rtl.h" > >>> > +#include "alias.h" > >>> > +#include "rtlanal.h" > >>> > +#include "tree-pass.h" > >>> > +#include "cselib.h" > >>> > + > >>> > +/* This is an RTL pass that detects store forwarding from stores to > larger > >>> > + loads (load pairs). For example, it can transform cases like > >>> > + > >>> > + str d5, [sp, #320] > >>> > + fmul d5, d31, d29 > >>> > + ldp d31, d17, [sp, #312] # Large load from small store > >>> > + > >>> > + to > >>> > + > >>> > + str d5, [sp, #320] > >>> > + fmul d5, d31, d29 > >>> > + ldr d31, [sp, #312] > >>> > + ldr d17, [sp, #320] > >>> > + > >>> > + Design: The pass follows a straightforward design. It starts by > >>> > + initializing the alias analysis and the cselib. Both of these > are used to > >>> > + find stores and larger loads with overlapping addresses, which > are > >>> > + candidates for store forwarding optimizations. It then scans on > basic block > >>> > + level to find stores that forward to larger loads and handles > them > >>> > + accordingly as described in the above example. Finally, the > alias analysis > >>> > + and the cselib library are closed. */ > >>> > + > >>> > +typedef struct > >>> > +{ > >>> > + rtx_insn *store_insn; > >>> > + rtx store_mem_addr; > >>> > + unsigned int insn_cnt; > >>> > +} str_info; > >>> > + > >>> > +typedef std::list<str_info> list_store_info; > >>> > + > >>> > +/* Statistics counters. */ > >>> > +static unsigned int stats_store_count = 0; > >>> > +static unsigned int stats_ldp_count = 0; > >>> > +static unsigned int stats_ssll_count = 0; > >>> > +static unsigned int stats_transformed_count = 0; > >>> > + > >>> > +/* Default. */ > >>> > +static rtx dummy; > >>> > +static bool is_load (rtx expr, rtx &op_1=dummy); > >>> > + > >>> > +/* Return true if SET expression EXPR is a store; otherwise false. > */ > >>> > + > >>> > +static bool > >>> > +is_store (rtx expr) > >>> > +{ > >>> > + return MEM_P (SET_DEST (expr)); > >>> > +} > >>> > + > >>> > +/* Return true if SET expression EXPR is a load; otherwise false. > OP_1 will > >>> > + contain the MEM operand of the load. */ > >>> > + > >>> > +static bool > >>> > +is_load (rtx expr, rtx &op_1) > >>> > +{ > >>> > + op_1 = SET_SRC (expr); > >>> > + > >>> > + if (GET_CODE (op_1) == ZERO_EXTEND > >>> > + || GET_CODE (op_1) == SIGN_EXTEND) > >>> > + op_1 = XEXP (op_1, 0); > >>> > + > >>> > + return MEM_P (op_1); > >>> > +} > >>> > + > >>> > +/* Return true if STORE_MEM_ADDR is forwarding to the address of > LOAD_MEM; > >>> > + otherwise false. STORE_MEM_MODE is the mode of the MEM rtx > containing > >>> > + STORE_MEM_ADDR. */ > >>> > + > >>> > +static bool > >>> > +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode > store_mem_mode) > >>> > +{ > >>> > + /* Sometimes we do not have the proper value. */ > >>> > + if (!CSELIB_VAL_PTR (store_mem_addr)) > >>> > + return false; > >>> > + > >>> > + gcc_checking_assert (MEM_P (load_mem)); > >>> > + > >>> > + return rtx_equal_for_cselib_1 (store_mem_addr, > >>> > + get_addr (XEXP (load_mem, 0)), > >>> > + store_mem_mode, 0); > >>> > +} > >>> > + > >>> > +/* Return true if INSN is a load pair, preceded by a store > forwarding to it; > >>> > + otherwise false. STORE_EXPRS contains the stores. */ > >>> > + > >>> > +static bool > >>> > +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn > *insn) > >>> > +{ > >>> > + unsigned int load_count = 0; > >>> > + bool forwarding = false; > >>> > + rtx expr = PATTERN (insn); > >>> > + > >>> > + if (GET_CODE (expr) != PARALLEL > >>> > + || XVECLEN (expr, 0) != 2) > >>> > + return false; > >>> > + > >>> > + for (int i = 0; i < XVECLEN (expr, 0); i++) > >>> > + { > >>> > + rtx op_1; > >>> > + rtx out_exp = XVECEXP (expr, 0, i); > >>> > + > >>> > + if (GET_CODE (out_exp) != SET) > >>> > + continue; > >>> > + > >>> > + if (!is_load (out_exp, op_1)) > >>> > + continue; > >>> > + > >>> > + load_count++; > >>> > + > >>> > + for (str_info str : store_exprs) > >>> > + { > >>> > + rtx store_insn = str.store_insn; > >>> > + > >>> > + if (!is_forwarding (str.store_mem_addr, op_1, > >>> > + GET_MODE (SET_DEST (PATTERN > (store_insn))))) > >>> > + continue; > >>> > + > >>> > + if (dump_file) > >>> > + { > >>> > + fprintf (dump_file, > >>> > + "Store forwarding to PARALLEL with loads:\n"); > >>> > + fprintf (dump_file, " From: "); > >>> > + print_rtl_single (dump_file, store_insn); > >>> > + fprintf (dump_file, " To: "); > >>> > + print_rtl_single (dump_file, insn); > >>> > + } > >>> > + > >>> > + forwarding = true; > >>> > + } > >>> > + } > >>> > + > >>> > + if (load_count == 2) > >>> > + stats_ldp_count++; > >>> > + > >>> > + return load_count == 2 && forwarding; > >>> > +} > >>> > + > >>> > +/* Break a load pair into its 2 distinct loads, except if the base > source > >>> > + address to load from is overwriten in the first load. INSN > should be the > >>> > + PARALLEL of the load pair. */ > >>> > + > >>> > +static void > >>> > +break_ldp (rtx_insn *insn) > >>> > +{ > >>> > + rtx expr = PATTERN (insn); > >>> > + > >>> > + gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN > (expr, 0) == 2); > >>> > + > >>> > + rtx load_0 = XVECEXP (expr, 0, 0); > >>> > + rtx load_1 = XVECEXP (expr, 0, 1); > >>> > + > >>> > + gcc_checking_assert (is_load (load_0) && is_load (load_1)); > >>> > + > >>> > + /* The base address was overwriten in the first load. */ > >>> > + if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1))) > >>> > + return; > >>> > + > >>> > + emit_insn_before (load_0, insn); > >>> > + emit_insn_before (load_1, insn); > >>> > + remove_insn (insn); > >>> > + > >>> > + stats_transformed_count++; > >>> > +} > >>> > + > >>> > +static void > >>> > +scan_and_transform_bb_level () > >>> > +{ > >>> > + rtx_insn *insn, *next; > >>> > + basic_block bb; > >>> > + FOR_EACH_BB_FN (bb, cfun) > >>> > + { > >>> > + list_store_info store_exprs; > >>> > + unsigned int insn_cnt = 0; > >>> > + for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); > insn = next) > >>> > + { > >>> > + next = NEXT_INSN (insn); > >>> > >>> You probably want NONDEBUG here, otherwise insn_cnt will depend > >>> on -g? > >>> > >>> > + /* If we cross a CALL_P insn, clear the list, because the > >>> > + small-store-to-large-load is unlikely to cause > performance > >>> > + difference. */ > >>> > + if (CALL_P (insn)) > >>> > + store_exprs.clear (); > >>> > + > >>> > + if (!NONJUMP_INSN_P (insn)) > >>> > + continue; > >>> > + > >>> > + cselib_process_insn (insn); > >>> > >>> is it necessary to process each insn with cselib? Only loads & stores > I guess? > >>> > >>> > + rtx expr = single_set (insn); > >>> > + > >>> > + /* If a store is encountered, append it to the store_exprs > list to > >>> > + check it later. */ > >>> > + if (expr && is_store (expr)) > >>> > + { > >>> > + rtx store_mem = SET_DEST (expr); > >>> > + rtx store_mem_addr = get_addr (XEXP (store_mem, 0)); > >>> > + machine_mode store_mem_mode = GET_MODE (store_mem); > >>> > + store_mem_addr = cselib_lookup (store_mem_addr, > >>> > + store_mem_mode, 1, > >>> > + > store_mem_mode)->val_rtx; > >>> > + store_exprs.push_back ({ insn, store_mem_addr, > insn_cnt }); > >>> > + stats_store_count++; > >>> > + } > >>> > + > >>> > + /* Check for small-store-to-large-load. */ > >>> > + if (is_small_store_to_large_load (store_exprs, insn)) > >>> > + { > >>> > + stats_ssll_count++; > >>> > + break_ldp (insn); > >>> > + } > >>> > + > >>> > + /* Pop the first store from the list if it's distance > crosses the > >>> > + maximum accepted threshold. The list contains unique > values > >>> > + sorted in ascending order, meaning that only one > distance can be > >>> > + off at a time. */ > >>> > + if (!store_exprs.empty () > >>> > + && (insn_cnt - store_exprs.front ().insn_cnt > >>> > + > (unsigned int) > aarch64_store_forwarding_threshold_param)) > >>> > + store_exprs.pop_front (); > >>> > + > >>> > + insn_cnt++; > >>> > + } > >>> > + } > >>> > +} > >>> > + > >>> > +static void > >>> > +execute_avoid_store_forwarding () > >>> > +{ > >>> > + init_alias_analysis (); > >>> > + cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS); > >>> > + scan_and_transform_bb_level (); > >>> > + end_alias_analysis (); > >>> > + cselib_finish (); > >>> > + statistics_counter_event (cfun, "Number of stores identified: ", > >>> > + stats_store_count); > >>> > + statistics_counter_event (cfun, "Number of load pairs identified: > ", > >>> > + stats_ldp_count); > >>> > + statistics_counter_event (cfun, > >>> > + "Number of forwarding cases identified: > ", > >>> > + stats_ssll_count); > >>> > + statistics_counter_event (cfun, "Number of trasformed cases: ", > >>> > + stats_transformed_count); > >>> > +} > >>> > + > >>> > +const pass_data pass_data_avoid_store_forwarding = > >>> > +{ > >>> > + RTL_PASS, /* type. */ > >>> > + "avoid_store_forwarding", /* name. */ > >>> > + OPTGROUP_NONE, /* optinfo_flags. */ > >>> > + TV_NONE, /* tv_id. */ > >>> > + 0, /* properties_required. */ > >>> > + 0, /* properties_provided. */ > >>> > + 0, /* properties_destroyed. */ > >>> > + 0, /* todo_flags_start. */ > >>> > + 0 /* todo_flags_finish. */ > >>> > +}; > >>> > + > >>> > +class pass_avoid_store_forwarding : public rtl_opt_pass > >>> > +{ > >>> > +public: > >>> > + pass_avoid_store_forwarding (gcc::context *ctxt) > >>> > + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt) > >>> > + {} > >>> > + > >>> > + /* opt_pass methods: */ > >>> > + virtual bool gate (function *) > >>> > + { > >>> > + return aarch64_flag_avoid_store_forwarding; > >>> > + } > >>> > + > >>> > + virtual unsigned int execute (function *) > >>> > + { > >>> > + execute_avoid_store_forwarding (); > >>> > + return 0; > >>> > + } > >>> > + > >>> > +}; // class pass_avoid_store_forwarding > >>> > + > >>> > +/* Create a new avoid store forwarding pass instance. */ > >>> > + > >>> > +rtl_opt_pass * > >>> > +make_pass_avoid_store_forwarding (gcc::context *ctxt) > >>> > +{ > >>> > + return new pass_avoid_store_forwarding (ctxt); > >>> > +} > >>> > diff --git a/gcc/config/aarch64/aarch64.opt > b/gcc/config/aarch64/aarch64.opt > >>> > index f5a518202a1..e4498d53b46 100644 > >>> > --- a/gcc/config/aarch64/aarch64.opt > >>> > +++ b/gcc/config/aarch64/aarch64.opt > >>> > @@ -304,6 +304,10 @@ moutline-atomics > >>> > Target Var(aarch64_flag_outline_atomics) Init(2) Save > >>> > Generate local calls to out-of-line atomic operations. > >>> > > >>> > +mavoid-store-forwarding > >>> > +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) > Optimization > >>> > +Avoid store forwarding to load pairs. > >>> > + > >>> > -param=aarch64-sve-compare-costs= > >>> > Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1) > IntegerRange(0, 1) Param > >>> > When vectorizing for SVE, consider using unpacked vectors for > smaller elements and use the cost model to pick the cheapest approach. > Also use the cost model to choose between SVE and Advanced SIMD > vectorization. > >>> > @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never) > Value(AARCH64_LDP_STP_POLICY_NEVER) > >>> > > >>> > EnumValue > >>> > Enum(aarch64_ldp_stp_policy) String(aligned) > Value(AARCH64_LDP_STP_POLICY_ALIGNED) > >>> > + > >>> > +-param=aarch64-store-forwarding-threshold= > >>> > +Target Joined UInteger > Var(aarch64_store_forwarding_threshold_param) Init(20) Param > >>> > +Maximum instruction distance allowed between a store and a load > pair for this to be > >>> > +considered a candidate to avoid when using -mavoid-store-forwarding. > >>> > diff --git a/gcc/config/aarch64/t-aarch64 > b/gcc/config/aarch64/t-aarch64 > >>> > index 0d96ae3d0b2..5676cdd9585 100644 > >>> > --- a/gcc/config/aarch64/t-aarch64 > >>> > +++ b/gcc/config/aarch64/t-aarch64 > >>> > @@ -194,6 +194,16 @@ aarch64-cc-fusion.o: > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \ > >>> > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) > $(INCLUDES) \ > >>> > $(srcdir)/config/aarch64/aarch64-cc-fusion.cc > >>> > > >>> > +aarch64-store-forwarding.o: \ > >>> > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \ > >>> > + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h > $(RTL_BASE_H) \ > >>> > + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) > $(RECOG_H) \ > >>> > + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \ > >>> > + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \ > >>> > + $(srcdir)/config/aarch64/aarch64-protos.h > >>> > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) > $(INCLUDES) \ > >>> > + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc > >>> > + > >>> > comma=, > >>> > MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, > $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG)))) > >>> > MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG)) > >>> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > >>> > index 32f535e1ed4..9bf3a83286a 100644 > >>> > --- a/gcc/doc/invoke.texi > >>> > +++ b/gcc/doc/invoke.texi > >>> > @@ -801,7 +801,7 @@ Objective-C and Objective-C++ Dialects}. > >>> > -moverride=@var{string} -mverbose-cost-dump > >>> > -mstack-protector-guard=@var{guard} > -mstack-protector-guard-reg=@var{sysreg} > >>> > -mstack-protector-guard-offset=@var{offset} -mtrack-speculation > >>> > --moutline-atomics } > >>> > +-moutline-atomics -mavoid-store-forwarding} > >>> > > >>> > @emph{Adapteva Epiphany Options} > >>> > @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs > >>> > @@ -16774,6 +16774,11 @@ With > @option{--param=aarch64-stp-policy=never}, do not emit stp. > >>> > With @option{--param=aarch64-stp-policy=aligned}, emit stp only if > the > >>> > source pointer is aligned to at least double the alignment of the > type. > >>> > > >>> > +@item aarch64-store-forwarding-threshold > >>> > +Maximum allowed instruction distance between a store and a load > pair for > >>> > +this to be considered a candidate to avoid when using > >>> > +@option{-mavoid-store-forwarding}. > >>> > + > >>> > @item aarch64-loop-vect-issue-rate-niters > >>> > The tuning for some AArch64 CPUs tries to take both latencies and > issue > >>> > rates into account when deciding whether a loop should be vectorized > >>> > @@ -20857,6 +20862,10 @@ Generate code which uses only the > general-purpose registers. This will prevent > >>> > the compiler from using floating-point and Advanced SIMD registers > but will not > >>> > impose any restrictions on the assembler. > >>> > > >>> > +@item -mavoid-store-forwarding > >>> > +@itemx -mno-avoid-store-forwarding > >>> > +Avoid store forwarding to load pairs. > >>> > + > >>> > @opindex mlittle-endian > >>> > @item -mlittle-endian > >>> > Generate little-endian code. This is the default when GCC is > configured for an > >>> > diff --git > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > >>> > new file mode 100644 > >>> > index 00000000000..b77de6c64b6 > >>> > --- /dev/null > >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c > >>> > @@ -0,0 +1,33 @@ > >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */ > >>> > + > >>> > +#include <stdint.h> > >>> > + > >>> > +typedef int v4si __attribute__ ((vector_size (16))); > >>> > + > >>> > +/* Different address, same offset, no overlap */ > >>> > + > >>> > +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \ > >>> > +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, > TYPE *st_arr_2, TYPE i, TYPE dummy){ \ > >>> > + TYPE r, y; \ > >>> > + st_arr[0] = i; \ > >>> > + ld_arr[0] = dummy; \ > >>> > + r = st_arr_2[0]; \ > >>> > + y = st_arr_2[1]; \ > >>> > + return r + y; \ > >>> > +} > >>> > + > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(int) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(long) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(float) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(double) > >>> > +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si) > >>> > + > >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } > } */ > >>> > diff --git > a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > >>> > new file mode 100644 > >>> > index 00000000000..f1b3a66abfd > >>> > --- /dev/null > >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c > >>> > @@ -0,0 +1,33 @@ > >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */ > >>> > + > >>> > +#include <stdint.h> > >>> > + > >>> > +typedef int v4si __attribute__ ((vector_size (16))); > >>> > + > >>> > +/* Same address, different offset, no overlap */ > >>> > + > >>> > +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \ > >>> > +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, > TYPE i, TYPE dummy){ \ > >>> > + TYPE r, y; \ > >>> > + st_arr[0] = i; \ > >>> > + ld_arr[0] = dummy; \ > >>> > + r = st_arr[10]; \ > >>> > + y = st_arr[11]; \ > >>> > + return r + y; \ > >>> > +} > >>> > + > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(int) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(long) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(float) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(double) > >>> > +LDP_SSLL_NO_OVERLAP_OFFSET(v4si) > >>> > + > >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } > } */ > >>> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > >>> > new file mode 100644 > >>> > index 00000000000..8d5ce5cc87e > >>> > --- /dev/null > >>> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c > >>> > @@ -0,0 +1,33 @@ > >>> > +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */ > >>> > + > >>> > +#include <stdint.h> > >>> > + > >>> > +typedef int v4si __attribute__ ((vector_size (16))); > >>> > + > >>> > +/* Same address, same offset, overlap */ > >>> > + > >>> > +#define LDP_SSLL_OVERLAP(TYPE) \ > >>> > +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, > TYPE dummy){ \ > >>> > + TYPE r, y; \ > >>> > + st_arr[0] = i; \ > >>> > + ld_arr[0] = dummy; \ > >>> > + r = st_arr[0]; \ > >>> > + y = st_arr[1]; \ > >>> > + return r + y; \ > >>> > +} > >>> > + > >>> > +LDP_SSLL_OVERLAP(uint32_t) > >>> > +LDP_SSLL_OVERLAP(uint64_t) > >>> > +LDP_SSLL_OVERLAP(int32_t) > >>> > +LDP_SSLL_OVERLAP(int64_t) > >>> > +LDP_SSLL_OVERLAP(int) > >>> > +LDP_SSLL_OVERLAP(long) > >>> > +LDP_SSLL_OVERLAP(float) > >>> > +LDP_SSLL_OVERLAP(double) > >>> > +LDP_SSLL_OVERLAP(v4si) > >>> > + > >>> > +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } > } */ > >>> > +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } > } */ > >>> > -- > >>> > 2.41.0 > >> > >> I will address the other two points on a seperate version. > >> > >> -- > >> Manos Anagnostakis | Compiler Engineer | > >> E: manos.anagnosta...@vrull.eu > >> > >> VRULL GmbH | Beatrixgasse 32 1030 Vienna | > >> W: www.vrull.eu | LinkedIn > > > > > > > > -- > > Manos Anagnostakis | Compiler Engineer | > > E: manos.anagnosta...@vrull.eu > > > > VRULL GmbH | Beatrixgasse 32 1030 Vienna | > > W: www.vrull.eu | LinkedIn >