Manos Anagnostakis <[email protected]> writes:
> This is an RTL pass that detects store forwarding from stores to larger loads
> (load pairs).
>
> This optimization is SPEC2017-driven and was found to be beneficial for some
> benchmarks,
> through testing on ampere1/ampere1a machines.
>
> For example, it can transform cases like
>
> str d5, [sp, #320]
> fmul d5, d31, d29
> ldp d31, d17, [sp, #312] # Large load from small store
>
> to
>
> str d5, [sp, #320]
> fmul d5, d31, d29
> ldr d31, [sp, #312]
> ldr d17, [sp, #320]
>
> Currently, the pass is disabled by default on all architectures and enabled
> by a target-specific option.
>
> If deemed beneficial enough for a default, it will be enabled on
> ampere1/ampere1a,
> or other architectures as well, without needing to be turned on by this
> option.
>
> Bootstrapped and regtested on aarch64-linux.
>
> gcc/ChangeLog:
>
> * config.gcc: Add aarch64-store-forwarding.o to extra_objs.
> * config/aarch64/aarch64-passes.def (INSERT_PASS_AFTER): New pass.
> * config/aarch64/aarch64-protos.h (make_pass_avoid_store_forwarding):
> Declare.
> * config/aarch64/aarch64.opt (mavoid-store-forwarding): New option.
> (aarch64-store-forwarding-threshold): New param.
> * config/aarch64/t-aarch64: Add aarch64-store-forwarding.o
> * doc/invoke.texi: Document new option and new param.
> * config/aarch64/aarch64-store-forwarding.cc: New file.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/ldp_ssll_no_overlap_address.c: New test.
> * gcc.target/aarch64/ldp_ssll_no_overlap_offset.c: New test.
> * gcc.target/aarch64/ldp_ssll_overlap.c: New test.
>
> Signed-off-by: Manos Anagnostakis <[email protected]>
> Co-Authored-By: Manolis Tsamis <[email protected]>
> Co-Authored-By: Philipp Tomsich <[email protected]>
> ---
> Changes in v4:
> - I had problems to make cselib_subst_to_values work correctly
> so I used cselib_lookup to implement the exact same behaviour and
> record the store value at the time we iterate over it.
> - Removed the store/load_mem_addr check from is_forwarding as
> unnecessary.
> - The pass is called on all optimization levels right now.
> - The threshold check should remain as it is as we only care for
> the front element of the list. The comment above the check explains
> why a single if is enough.
I still think this is structurally better as a while. There's no reason
in principle we why wouldn't want to record the stores in:
stp x0, x1, [x4, #8]
ldp x0, x1, [x4, #0]
ldp x2, x3, [x4, #16]
and then the two stores should have the same distance value.
I realise we don't do that yet, but still.
> - The documentation changes requested.
> - Adjusted a comment.
>
> gcc/config.gcc | 1 +
> gcc/config/aarch64/aarch64-passes.def | 1 +
> gcc/config/aarch64/aarch64-protos.h | 1 +
> .../aarch64/aarch64-store-forwarding.cc | 321 ++++++++++++++++++
> gcc/config/aarch64/aarch64.opt | 9 +
> gcc/config/aarch64/t-aarch64 | 10 +
> gcc/doc/invoke.texi | 11 +-
> .../aarch64/ldp_ssll_no_overlap_address.c | 33 ++
> .../aarch64/ldp_ssll_no_overlap_offset.c | 33 ++
> .../gcc.target/aarch64/ldp_ssll_overlap.c | 33 ++
> 10 files changed, 452 insertions(+), 1 deletion(-)
> create mode 100644 gcc/config/aarch64/aarch64-store-forwarding.cc
> create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> create mode 100644
> gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
>
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 748430194f3..2ee3b61c4fa 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -350,6 +350,7 @@ aarch64*-*-*)
> cxx_target_objs="aarch64-c.o"
> d_target_objs="aarch64-d.o"
> extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o
> aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o
> aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o
> falkor-tag-collision-avoidance.o aarch-bti-insert.o aarch64-cc-fusion.o"
> + extra_objs="${extra_objs} aarch64-store-forwarding.o"
> target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.cc
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> target_has_targetm_common=yes
> ;;
> diff --git a/gcc/config/aarch64/aarch64-passes.def
> b/gcc/config/aarch64/aarch64-passes.def
> index 6ace797b738..fa79e8adca8 100644
> --- a/gcc/config/aarch64/aarch64-passes.def
> +++ b/gcc/config/aarch64/aarch64-passes.def
> @@ -23,3 +23,4 @@ INSERT_PASS_BEFORE (pass_reorder_blocks, 1,
> pass_track_speculation);
> INSERT_PASS_AFTER (pass_machine_reorg, 1, pass_tag_collision_avoidance);
> INSERT_PASS_BEFORE (pass_shorten_branches, 1, pass_insert_bti);
> INSERT_PASS_AFTER (pass_if_after_combine, 1, pass_cc_fusion);
> +INSERT_PASS_AFTER (pass_peephole2, 1, pass_avoid_store_forwarding);
> diff --git a/gcc/config/aarch64/aarch64-protos.h
> b/gcc/config/aarch64/aarch64-protos.h
> index d2718cc87b3..7d9dfa06af9 100644
> --- a/gcc/config/aarch64/aarch64-protos.h
> +++ b/gcc/config/aarch64/aarch64-protos.h
> @@ -1050,6 +1050,7 @@ rtl_opt_pass *make_pass_track_speculation (gcc::context
> *);
> rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *);
> rtl_opt_pass *make_pass_insert_bti (gcc::context *ctxt);
> rtl_opt_pass *make_pass_cc_fusion (gcc::context *ctxt);
> +rtl_opt_pass *make_pass_avoid_store_forwarding (gcc::context *ctxt);
>
> poly_uint64 aarch64_regmode_natural_size (machine_mode);
>
> diff --git a/gcc/config/aarch64/aarch64-store-forwarding.cc
> b/gcc/config/aarch64/aarch64-store-forwarding.cc
> new file mode 100644
> index 00000000000..ae3cbe519cd
> --- /dev/null
> +++ b/gcc/config/aarch64/aarch64-store-forwarding.cc
> @@ -0,0 +1,321 @@
> +/* Avoid store forwarding optimization pass.
> + Copyright (C) 2023 Free Software Foundation, Inc.
> + Contributed by VRULL GmbH.
> +
> + This file is part of GCC.
> +
> + GCC is free software; you can redistribute it and/or modify it
> + under the terms of the GNU General Public License as published by
> + the Free Software Foundation; either version 3, or (at your option)
> + any later version.
> +
> + GCC is distributed in the hope that it will be useful, but
> + WITHOUT ANY WARRANTY; without even the implied warranty of
> + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + General Public License for more details.
> +
> + You should have received a copy of the GNU General Public License
> + along with GCC; see the file COPYING3. If not see
> + <http://www.gnu.org/licenses/>. */
> +
> +#define IN_TARGET_CODE 1
> +
> +#include "config.h"
> +#define INCLUDE_LIST
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "alias.h"
> +#include "rtlanal.h"
> +#include "tree-pass.h"
> +#include "cselib.h"
> +
> +/* This is an RTL pass that detects store forwarding from stores to larger
> + loads (load pairs). For example, it can transform cases like
> +
> + str d5, [sp, #320]
> + fmul d5, d31, d29
> + ldp d31, d17, [sp, #312] # Large load from small store
> +
> + to
> +
> + str d5, [sp, #320]
> + fmul d5, d31, d29
> + ldr d31, [sp, #312]
> + ldr d17, [sp, #320]
> +
> + Design: The pass follows a straightforward design. It starts by
> + initializing the alias analysis and the cselib. Both of these are used to
> + find stores and larger loads with overlapping addresses, which are
> + candidates for store forwarding optimizations. It then scans on basic
> block
> + level to find stores that forward to larger loads and handles them
> + accordingly as described in the above example. Finally, the alias
> analysis
> + and the cselib library are closed. */
> +
> +typedef struct
> +{
> + rtx_insn *store_insn;
> + rtx store_mem_addr;
> + unsigned int insn_cnt;
> +} store_info;
> +
> +typedef std::list<store_info> list_store_info;
> +
> +/* Statistics counters. */
> +static unsigned int stats_store_count = 0;
> +static unsigned int stats_ldp_count = 0;
> +static unsigned int stats_ssll_count = 0;
> +static unsigned int stats_transformed_count = 0;
> +
> +/* Default. */
> +static rtx dummy;
> +static bool is_load (rtx expr, rtx &op_1=dummy);
> +
> +/* Return true if SET expression EXPR is a store; otherwise false. */
> +
> +static bool
> +is_store (rtx expr)
> +{
> + return MEM_P (SET_DEST (expr));
> +}
> +
> +/* Return true if SET expression EXPR is a load; otherwise false. OP_1 will
> + contain the MEM operand of the load. */
> +
> +static bool
> +is_load (rtx expr, rtx &op_1)
> +{
> + op_1 = SET_SRC (expr);
> +
> + if (GET_CODE (op_1) == ZERO_EXTEND
> + || GET_CODE (op_1) == SIGN_EXTEND)
> + op_1 = XEXP (op_1, 0);
> +
> + return MEM_P (op_1);
> +}
> +
> +/* Return true if STORE_MEM_ADDR is forwarding to the address of LOAD_MEM;
> + otherwise false. STORE_MEM_MODE is the mode of the MEM rtx containing
> + STORE_MEM_ADDR. */
> +
> +static bool
> +is_forwarding (rtx store_mem_addr, rtx load_mem, machine_mode store_mem_mode)
> +{
> + /* Sometimes we do not have the proper value. */
> + if (!CSELIB_VAL_PTR (store_mem_addr))
> + return false;
> +
> + gcc_checking_assert (MEM_P (load_mem));
> +
> + rtx load_mem_addr = get_addr (XEXP (load_mem, 0));
> + machine_mode load_mem_mode = GET_MODE (load_mem);
> + load_mem_addr = cselib_lookup (load_mem_addr, load_mem_mode, 1,
> + load_mem_mode)->val_rtx;
Like I said in the previous review, it shouldn't be necessary to do any
manual lookup on the load address. rtx_equal_for_cselib_1 does the
lookup itself. Does that not work?
The patch is OK with the four lines above deleted, if that works,
and with s/if/while/. But please reply if that combination doesn't work.
Thanks,
Richard
> + return rtx_equal_for_cselib_1 (store_mem_addr,
> + load_mem_addr,
> + store_mem_mode, 0);
> +}
> +
> +/* Return true if INSN is a load pair, preceded by a store forwarding to it;
> + otherwise false. STORE_EXPRS contains the stores. */
> +
> +static bool
> +is_small_store_to_large_load (list_store_info store_exprs, rtx_insn *insn)
> +{
> + unsigned int load_count = 0;
> + bool forwarding = false;
> + rtx expr = PATTERN (insn);
> +
> + if (GET_CODE (expr) != PARALLEL
> + || XVECLEN (expr, 0) != 2)
> + return false;
> +
> + for (int i = 0; i < XVECLEN (expr, 0); i++)
> + {
> + rtx op_1;
> + rtx out_exp = XVECEXP (expr, 0, i);
> +
> + if (GET_CODE (out_exp) != SET)
> + continue;
> +
> + if (!is_load (out_exp, op_1))
> + continue;
> +
> + load_count++;
> +
> + for (store_info str : store_exprs)
> + {
> + rtx store_insn = str.store_insn;
> +
> + if (!is_forwarding (str.store_mem_addr, op_1,
> + GET_MODE (SET_DEST (PATTERN (store_insn)))))
> + continue;
> +
> + if (dump_file)
> + {
> + fprintf (dump_file,
> + "Store forwarding to PARALLEL with loads:\n");
> + fprintf (dump_file, " From: ");
> + print_rtl_single (dump_file, store_insn);
> + fprintf (dump_file, " To: ");
> + print_rtl_single (dump_file, insn);
> + }
> +
> + forwarding = true;
> + }
> + }
> +
> + if (load_count == 2)
> + stats_ldp_count++;
> +
> + return load_count == 2 && forwarding;
> +}
> +
> +/* Break a load pair into its 2 distinct loads, except if the base source
> + address to load from is overwriten in the first load. INSN should be the
> + PARALLEL of the load pair. */
> +
> +static void
> +break_ldp (rtx_insn *insn)
> +{
> + rtx expr = PATTERN (insn);
> +
> + gcc_checking_assert (GET_CODE (expr) == PARALLEL && XVECLEN (expr, 0) ==
> 2);
> +
> + rtx load_0 = XVECEXP (expr, 0, 0);
> + rtx load_1 = XVECEXP (expr, 0, 1);
> +
> + gcc_checking_assert (is_load (load_0) && is_load (load_1));
> +
> + /* The base address was overwriten in the first load. */
> + if (reg_mentioned_p (SET_DEST (load_0), SET_SRC (load_1)))
> + return;
> +
> + emit_insn_before (load_0, insn);
> + emit_insn_before (load_1, insn);
> + remove_insn (insn);
> +
> + stats_transformed_count++;
> +}
> +
> +static void
> +scan_and_transform_bb_level ()
> +{
> + rtx_insn *insn, *next;
> + basic_block bb;
> + FOR_EACH_BB_FN (bb, cfun)
> + {
> + list_store_info store_exprs;
> + unsigned int insn_cnt = 0;
> + for (insn = BB_HEAD (bb); insn != NEXT_INSN (BB_END (bb)); insn = next)
> + {
> + next = NEXT_INSN (insn);
> +
> + /* If we cross a CALL_P insn, clear the list, because the
> + small-store-to-large-load is unlikely to cause performance
> + difference. */
> + if (CALL_P (insn))
> + store_exprs.clear ();
> +
> + if (!NONJUMP_INSN_P (insn))
> + continue;
> +
> + cselib_process_insn (insn);
> +
> + rtx expr = single_set (insn);
> +
> + /* If a store is encountered, append it to the store_exprs list to
> + check it later. */
> + if (expr && is_store (expr))
> + {
> + rtx store_mem = SET_DEST (expr);
> + rtx store_mem_addr = get_addr (XEXP (store_mem, 0));
> + machine_mode store_mem_mode = GET_MODE (store_mem);
> + store_mem_addr = cselib_lookup (store_mem_addr,
> + store_mem_mode, 1,
> + store_mem_mode)->val_rtx;
> + store_exprs.push_back ({ insn, store_mem_addr, insn_cnt++ });
> + stats_store_count++;
> + }
> +
> + /* Check for small-store-to-large-load. */
> + if (is_small_store_to_large_load (store_exprs, insn))
> + {
> + stats_ssll_count++;
> + break_ldp (insn);
> + }
> +
> + /* Pop the first store from the list if it's distance crosses the
> + maximum accepted threshold. The list contains unique values
> + sorted in ascending order, meaning that only one distance can be
> + off at a time. */
> + if (!store_exprs.empty ()
> + && (insn_cnt - store_exprs.front ().insn_cnt
> + > (unsigned int) aarch64_store_forwarding_threshold_param))
> + store_exprs.pop_front ();
> + }
> + }
> +}
> +
> +static void
> +execute_avoid_store_forwarding ()
> +{
> + init_alias_analysis ();
> + cselib_init (CSELIB_RECORD_MEMORY | CSELIB_PRESERVE_CONSTANTS);
> + scan_and_transform_bb_level ();
> + end_alias_analysis ();
> + cselib_finish ();
> + statistics_counter_event (cfun, "Number of stores identified: ",
> + stats_store_count);
> + statistics_counter_event (cfun, "Number of load pairs identified: ",
> + stats_ldp_count);
> + statistics_counter_event (cfun,
> + "Number of forwarding cases identified: ",
> + stats_ssll_count);
> + statistics_counter_event (cfun, "Number of trasformed cases: ",
> + stats_transformed_count);
> +}
> +
> +const pass_data pass_data_avoid_store_forwarding =
> +{
> + RTL_PASS, /* type. */
> + "avoid_store_forwarding", /* name. */
> + OPTGROUP_NONE, /* optinfo_flags. */
> + TV_NONE, /* tv_id. */
> + 0, /* properties_required. */
> + 0, /* properties_provided. */
> + 0, /* properties_destroyed. */
> + 0, /* todo_flags_start. */
> + 0 /* todo_flags_finish. */
> +};
> +
> +class pass_avoid_store_forwarding : public rtl_opt_pass
> +{
> +public:
> + pass_avoid_store_forwarding (gcc::context *ctxt)
> + : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> + {}
> +
> + /* opt_pass methods: */
> + virtual bool gate (function *)
> + {
> + return aarch64_flag_avoid_store_forwarding;
> + }
> +
> + virtual unsigned int execute (function *)
> + {
> + execute_avoid_store_forwarding ();
> + return 0;
> + }
> +
> +}; // class pass_avoid_store_forwarding
> +
> +/* Create a new avoid store forwarding pass instance. */
> +
> +rtl_opt_pass *
> +make_pass_avoid_store_forwarding (gcc::context *ctxt)
> +{
> + return new pass_avoid_store_forwarding (ctxt);
> +}
> diff --git a/gcc/config/aarch64/aarch64.opt b/gcc/config/aarch64/aarch64.opt
> index f5a518202a1..e4498d53b46 100644
> --- a/gcc/config/aarch64/aarch64.opt
> +++ b/gcc/config/aarch64/aarch64.opt
> @@ -304,6 +304,10 @@ moutline-atomics
> Target Var(aarch64_flag_outline_atomics) Init(2) Save
> Generate local calls to out-of-line atomic operations.
>
> +mavoid-store-forwarding
> +Target Bool Var(aarch64_flag_avoid_store_forwarding) Init(0) Optimization
> +Avoid store forwarding to load pairs.
> +
> -param=aarch64-sve-compare-costs=
> Target Joined UInteger Var(aarch64_sve_compare_costs) Init(1)
> IntegerRange(0, 1) Param
> When vectorizing for SVE, consider using unpacked vectors for smaller
> elements and use the cost model to pick the cheapest approach. Also use the
> cost model to choose between SVE and Advanced SIMD vectorization.
> @@ -360,3 +364,8 @@ Enum(aarch64_ldp_stp_policy) String(never)
> Value(AARCH64_LDP_STP_POLICY_NEVER)
>
> EnumValue
> Enum(aarch64_ldp_stp_policy) String(aligned)
> Value(AARCH64_LDP_STP_POLICY_ALIGNED)
> +
> +-param=aarch64-store-forwarding-threshold=
> +Target Joined UInteger Var(aarch64_store_forwarding_threshold_param)
> Init(20) Param
> +Maximum instruction distance allowed between a store and a load pair for
> this to be
> +considered a candidate to avoid when using -mavoid-store-forwarding.
> diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64
> index a9a244ab6d6..7639b50358d 100644
> --- a/gcc/config/aarch64/t-aarch64
> +++ b/gcc/config/aarch64/t-aarch64
> @@ -176,6 +176,16 @@ aarch64-cc-fusion.o:
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc \
> $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> $(srcdir)/config/aarch64/aarch64-cc-fusion.cc
>
> +aarch64-store-forwarding.o: \
> + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc \
> + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \
> + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \
> + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \
> + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \
> + $(srcdir)/config/aarch64/aarch64-protos.h
> + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> + $(srcdir)/config/aarch64/aarch64-store-forwarding.cc
> +
> comma=,
> MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst
> $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG))))
> MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 2b51ff304f6..39dbc04207e 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -798,7 +798,7 @@ Objective-C and Objective-C++ Dialects}.
> -moverride=@var{string} -mverbose-cost-dump
> -mstack-protector-guard=@var{guard} -mstack-protector-guard-reg=@var{sysreg}
> -mstack-protector-guard-offset=@var{offset} -mtrack-speculation
> --moutline-atomics }
> +-moutline-atomics -mavoid-store-forwarding}
>
> @emph{Adapteva Epiphany Options}
> @gccoptlist{-mhalf-reg-file -mprefer-short-insn-regs
> @@ -16738,6 +16738,11 @@ With @option{--param=aarch64-stp-policy=never}, do
> not emit stp.
> With @option{--param=aarch64-stp-policy=aligned}, emit stp only if the
> source pointer is aligned to at least double the alignment of the type.
>
> +@item aarch64-store-forwarding-threshold
> +Maximum allowed instruction distance between a store and a load pair for
> +this to be considered a candidate to avoid when using
> +@option{-mavoid-store-forwarding}.
> +
> @item aarch64-loop-vect-issue-rate-niters
> The tuning for some AArch64 CPUs tries to take both latencies and issue
> rates into account when deciding whether a loop should be vectorized
> @@ -20763,6 +20768,10 @@ Generate code which uses only the general-purpose
> registers. This will prevent
> the compiler from using floating-point and Advanced SIMD registers but will
> not
> impose any restrictions on the assembler.
>
> +@item -mavoid-store-forwarding
> +@itemx -mno-avoid-store-forwarding
> +Avoid store forwarding to load pairs.
> +
> @opindex mlittle-endian
> @item -mlittle-endian
> Generate little-endian code. This is the default when GCC is configured for
> an
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> new file mode 100644
> index 00000000000..b77de6c64b6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_address.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Different address, same offset, no overlap */
> +
> +#define LDP_SSLL_NO_OVERLAP_ADDRESS(TYPE) \
> +TYPE ldp_ssll_no_overlap_address_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE
> *st_arr_2, TYPE i, TYPE dummy){ \
> + TYPE r, y; \
> + st_arr[0] = i; \
> + ld_arr[0] = dummy; \
> + r = st_arr_2[0]; \
> + y = st_arr_2[1]; \
> + return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(uint64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int32_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int64_t)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(int)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(long)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(float)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(double)
> +LDP_SSLL_NO_OVERLAP_ADDRESS(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> new file mode 100644
> index 00000000000..f1b3a66abfd
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_no_overlap_offset.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, different offset, no overlap */
> +
> +#define LDP_SSLL_NO_OVERLAP_OFFSET(TYPE) \
> +TYPE ldp_ssll_no_overlap_offset_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i,
> TYPE dummy){ \
> + TYPE r, y; \
> + st_arr[0] = i; \
> + ld_arr[0] = dummy; \
> + r = st_arr[10]; \
> + y = st_arr[11]; \
> + return r + y; \
> +}
> +
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(uint64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int32_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int64_t)
> +LDP_SSLL_NO_OVERLAP_OFFSET(int)
> +LDP_SSLL_NO_OVERLAP_OFFSET(long)
> +LDP_SSLL_NO_OVERLAP_OFFSET(float)
> +LDP_SSLL_NO_OVERLAP_OFFSET(double)
> +LDP_SSLL_NO_OVERLAP_OFFSET(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 3 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 1 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> new file mode 100644
> index 00000000000..8d5ce5cc87e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/ldp_ssll_overlap.c
> @@ -0,0 +1,33 @@
> +/* { dg-options "-O2 -mcpu=generic -mavoid-store-forwarding" } */
> +
> +#include <stdint.h>
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +/* Same address, same offset, overlap */
> +
> +#define LDP_SSLL_OVERLAP(TYPE) \
> +TYPE ldp_ssll_overlap_##TYPE(TYPE *ld_arr, TYPE *st_arr, TYPE i, TYPE
> dummy){ \
> + TYPE r, y; \
> + st_arr[0] = i; \
> + ld_arr[0] = dummy; \
> + r = st_arr[0]; \
> + y = st_arr[1]; \
> + return r + y; \
> +}
> +
> +LDP_SSLL_OVERLAP(uint32_t)
> +LDP_SSLL_OVERLAP(uint64_t)
> +LDP_SSLL_OVERLAP(int32_t)
> +LDP_SSLL_OVERLAP(int64_t)
> +LDP_SSLL_OVERLAP(int)
> +LDP_SSLL_OVERLAP(long)
> +LDP_SSLL_OVERLAP(float)
> +LDP_SSLL_OVERLAP(double)
> +LDP_SSLL_OVERLAP(v4si)
> +
> +/* { dg-final { scan-assembler-times "ldp\tw\[0-9\]+, w\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tx\[0-9\]+, x\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\ts\[0-9\]+, s\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\td\[0-9\]+, d\[0-9\]" 0 } } */
> +/* { dg-final { scan-assembler-times "ldp\tq\[0-9\]+, q\[0-9\]" 0 } } */
> --
> 2.41.0