Hi, On 14 February 2018 at 09:47, Kugan Vivekanandarajah <kugan.vivekanandara...@linaro.org> wrote: > Hi Kyrill, > > On 13 February 2018 at 20:47, Kyrill Tkachov > <kyrylo.tkac...@foss.arm.com> wrote: >> Hi Kugan, >> >> On 12/02/18 23:58, Kugan Vivekanandarajah wrote: >>> >>> Implements a machine reorg pass for aarch64/Falkor to handle >>> prefetcher tag collision. This is strictly not part of the loop >>> unroller but for Falkor, unrolling can make h/w prefetcher performing >>> badly if there are too much tag collisions based on the discussions in >>> https://gcc.gnu.org/ml/gcc/2017-10/msg00178.html. >>> >> >> Could you expand a bit more on what transformation exactly this pass does? > > This is similar to what LLVM does in https://reviews.llvm.org/D35366. > > Falkor hardware prefetcher works well when signature of the prefetches > (or tags as computed in the patch - similar to LLVM) are different for > different memory streams. If different memory streams have the same > signature, it can result in bad performance. This machine reorg pass > tries to change the signature of memory loads by changing the base > register with a free register. > >> From my understanding the loads that use the same base >> register and offset and have the same destination register >> are considered part of the same stream by the hardware prefetcher, so for >> example: >> ldr x0, [x1, 16] (load1) >> ... (set x1 to something else) >> ldr x0, [x1, 16] (load2) >> >> will cause the prefetcher to think that both loads are part of the same >> stream, >> so this pass tries to rewrite the sequence into: >> ldr x0, [x1, 16] >> ... (set x1 to something else) >> mov tmp, x1 >> ldr x0, [tmp, 16] >> >> Where the tag/signature is the combination of destination x0, base x1 and >> offset 16. >> Is this a fair description? > > This is precisely what is happening. > >> >> I've got some comments on the patch itself >> >>> gcc/ChangeLog: >>> >>> 2018-02-12 Kugan Vivekanandarajah <kug...@linaro.org> >>> >>> * config/aarch64/aarch64.c (iv_p): New. >>> (strided_load_p): Likwise. >>> (make_tag): Likesie. >>> (get_load_info): Likewise. >>> (aarch64_reorg): Likewise. >>> (TARGET_MACHINE_DEPENDENT_REORG): Implement new target hook. >> >> >> New functions need function comments describing the arguments at least. >> Functions like make_tag, get_load_info etc can get tricky to maintain >> without >> some documentation on what they are supposed to accept and return. > > I wil add the comments. > >> >> I think the pass should be enabled at certain optimisation levels, say -O2? >> I don't think it would be desirable at -Os since it creates extra moves that >> increase code size. > > Ok, I will change this. > >> >> That being said, I would recommend you implement this as an aarch64-specific >> pass, >> in a similar way to cortex-a57-fma-steering.c. That way you can register it >> in >> aarch64-passes.def and have flexibility as to when exactly the pass gets to >> run >> (i.e. you wouldn't be limited by when machine_reorg gets run). >> >> Also, I suggest you don't use the "if (aarch64_tune != falkor) return;" way >> of >> gating this pass. Do it in a similar way to the FMA steering pass that is, >> define a new flag in aarch64-tuning-flags.def and use it in the tune_flags >> field >> of the falkor tuning struct. > > Ok, I will revise the patch.
Here is the revised patch. Thanks, Kugan gcc/ChangeLog: 2018-02-15 Kugan Vivekanandarajah <kug...@linaro.org> * config.gcc: Add falkor-tag-collision-avoidance.o to extra_objs for aarch64-*-*. * config/aarch64/aarch64-protos.h (make_pass_tag_collision_avoidance): Declare. * config/aarch64/aarch64-passes.def: Insert tag collision avoidance pass. * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION): Define. * config/aarch64/aarch64.c (qdf24xx_tunings): Add AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION. * config/aarch64/falkor-tag-collision-avoidance.c: New file. * config/aarch64/t-aarch64: Add falkor-tag-collision-avoidance.o. > > > Thanks, > Kugan > >> >> Hope this helps, >> Kyrill
diff --git a/gcc/config.gcc b/gcc/config.gcc index eca156a..c3f3e1a 100644 --- a/gcc/config.gcc +++ b/gcc/config.gcc @@ -303,7 +303,7 @@ aarch64*-*-*) extra_headers="arm_fp16.h arm_neon.h arm_acle.h" c_target_objs="aarch64-c.o" cxx_target_objs="aarch64-c.o" - extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o" + extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o falkor-tag-collision-avoidance.o" target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c" target_has_targetm_common=yes ;; diff --git a/gcc/config/aarch64/aarch64-passes.def b/gcc/config/aarch64/aarch64-passes.def index 87747b4..d4b6a43 100644 --- a/gcc/config/aarch64/aarch64-passes.def +++ b/gcc/config/aarch64/aarch64-passes.def @@ -19,3 +19,4 @@ <http://www.gnu.org/licenses/>. */ INSERT_PASS_AFTER (pass_regrename, 1, pass_fma_steering); +INSERT_PASS_AFTER (pass_fast_rtl_dce, 1, pass_tag_collision_avoidance); diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 2d705d2..d8f6964 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -544,6 +544,7 @@ std::string aarch64_get_extension_string_for_isa_flags (unsigned long, unsigned long); rtl_opt_pass *make_pass_fma_steering (gcc::context *ctxt); +rtl_opt_pass *make_pass_tag_collision_avoidance (gcc::context *ctxt); poly_uint64 aarch64_regmode_natural_size (machine_mode); diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def b/gcc/config/aarch64/aarch64-tuning-flags.def index ea9ead2..c0dd178 100644 --- a/gcc/config/aarch64/aarch64-tuning-flags.def +++ b/gcc/config/aarch64/aarch64-tuning-flags.def @@ -41,4 +41,6 @@ AARCH64_EXTRA_TUNING_OPTION ("slow_unaligned_ldpw", SLOW_UNALIGNED_LDPW) are not considered cheap. */ AARCH64_EXTRA_TUNING_OPTION ("cheap_shift_extend", CHEAP_SHIFT_EXTEND) +AARCH64_EXTRA_TUNING_OPTION ("avoid_prefetch_tag_collision", AVOID_PREFETCH_TAG_COLLISION) + #undef AARCH64_EXTRA_TUNING_OPTION diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 2e70f3a..b075325 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -875,7 +875,7 @@ static const struct tune_params qdf24xx_tunings = 2, /* min_div_recip_mul_df. */ 0, /* max_case_values. */ tune_params::AUTOPREFETCHER_WEAK, /* autoprefetcher_model. */ - (AARCH64_EXTRA_TUNE_NONE), /* tune_flags. */ + (AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION), /* tune_flags. */ &qdf24xx_prefetch_tune }; diff --git a/gcc/config/aarch64/falkor-tag-collision-avoidance.c b/gcc/config/aarch64/falkor-tag-collision-avoidance.c index e69de29..1fe320f 100644 --- a/gcc/config/aarch64/falkor-tag-collision-avoidance.c +++ b/gcc/config/aarch64/falkor-tag-collision-avoidance.c @@ -0,0 +1,468 @@ +/* Tag Collision Avoidance pass for Falkor. + Copyright (C) 2018 Free Software Foundation, Inc. + + This file is part of GCC. + + GCC is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + GCC is distributed in the hope that it will be useful, but + WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received a copy of the GNU General Public License + along with GCC; see the file COPYING3. If not see + <http://www.gnu.org/licenses/>. */ + +#define IN_TARGET_CODE 1 + +#include "config.h" +#define INCLUDE_LIST +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "target.h" +#include "rtl.h" +#include "tree.h" +#include "tree-pass.h" +#include "aarch64-protos.h" +#include "hash-map.h" +#include "cfgloop.h" +#include "cfgrtl.h" +#include "rtl-iter.h" +#include "df.h" +#include "memmodel.h" +#include "optabs.h" +#include "regs.h" +#include "recog.h" + +/* + Falkor hardware prefetcher works well when signature of the prefetches + (or tags as computed in the patch) are different for different memory + streams. If different memory streams have the same signature, it can + result in bad performance. This pass tries to change the signature of + memory loads by changing the base register with a free register. + + Signature (TAG) is based on SRC, DST and Offset. If the signature is + is same, it will be considered part of the same stream by the hardware + prefetcher, for example: + ldr x0, [x1, 16] (load stream 1) + x1 is resused for a different stream + ldr x0, [x1, 16] (load stream 2) + + will cause the prefetcher to think that both loads are part of the same + stream, so this pass tries to rewrite the sequence into: + ldr x0, [x1, 16] + mov tmp, x1 + ldr x0, [tmp, 16] + Such that the signatures are different. */ + + +/* Return true if the REG is an IV in the LOOP, false otherwise. + This is an approximate check and does not rely on the functionality + provided likes of biv () as the loop form might not be suitable for + such analysis. */ + +static bool +iv_p (rtx reg, struct loop *loop) +{ + df_ref adef; + unsigned regno = REGNO (reg); + bool def_in_loop = false; + bool def_out_loop = false; + + if (GET_MODE_CLASS (GET_MODE (reg)) != MODE_INT) + return false; + + for (adef = DF_REG_DEF_CHAIN (regno); adef; adef = DF_REF_NEXT_REG (adef)) + { + if (!DF_REF_INSN_INFO (adef) + || !NONDEBUG_INSN_P (DF_REF_INSN (adef))) + continue; + + basic_block bb = DF_REF_BB (adef); + if (dominated_by_p (CDI_DOMINATORS, bb, loop->header) + && bb->loop_father == loop) + { + rtx_insn *insn = DF_REF_INSN (adef); + recog_memoized (insn); + rtx pat = PATTERN (insn); + if (GET_CODE (pat) != SET) + continue; + rtx x = SET_SRC (pat); + if (GET_CODE (x) == ZERO_EXTRACT + || GET_CODE (x) == ZERO_EXTEND + || GET_CODE (x) == SIGN_EXTEND) + x = XEXP (x, 0); + if (MEM_P (x)) + continue; + if (GET_CODE (x) == POST_INC + || GET_CODE (x) == POST_DEC + || GET_CODE (x) == PRE_INC + || GET_CODE (x) == PRE_DEC) + def_in_loop = true; + else if (BINARY_P (x)) + def_in_loop = true; + } + if (dominated_by_p (CDI_DOMINATORS, loop->header, bb)) + def_out_loop = true; + if (def_in_loop && def_out_loop) + return true; + } + return false; +} + +/* Return true if X is a strided load in the LOOP, false otherwise. + If it is a strided load, set the BASE and OFFSET. Also, if this is + a pre/post increment load, set PRE_POST to true. */ + +static bool +strided_load_p (rtx x, + struct loop *loop, + bool *pre_post, + rtx *base, + rtx *offset) +{ + /* Loadded value is extended, get src. */ + if (GET_CODE (x) == ZERO_EXTRACT + || GET_CODE (x) == ZERO_EXTEND + || GET_CODE (x) == SIGN_EXTEND) + x = XEXP (x, 0); + + /* If it is not MEM_P, it is not lodade from mem. */ + if (!MEM_P (x)) + return false; + + /* Get the src of MEM_P. */ + x = XEXP (x, 0); + + /* If it is a post/pre increment, get the src. */ + if (GET_CODE (x) == POST_INC + || GET_CODE (x) == POST_DEC + || GET_CODE (x) == PRE_INC + || GET_CODE (x) == PRE_DEC) + { + x = XEXP (x, 0); + *pre_post = true; + } + + /* get base and offset depending on the type. */ + if (REG_P (x) + || UNARY_P (x)) + { + if (!REG_P (x)) + x = XEXP (x, 0); + if (REG_P (x) + && iv_p (x, loop)) + { + *base = x; + return true; + } + } + else if (BINARY_P (x)) + { + rtx reg1, reg2; + reg1 = XEXP (x, 0); + + if (REG_P (reg1) + && REGNO (reg1) == SP_REGNUM) + return false; + reg2 = XEXP (x, 1); + + if (REG_P (reg1) + && iv_p (reg1, loop)) + { + + *base = reg1; + *offset = reg2; + return true; + } + + if (REG_P (reg1) + && REG_P (reg2) + && iv_p (reg2, loop)) + { + *base = reg1; + *offset = reg2; + return true; + } + } + return false; +} + +/* Compute the TAG (or signature) based on BASE, DEST and + OFFSET of the load. */ + +static unsigned +make_tag (unsigned dest, unsigned base, unsigned offset) +{ + return (dest & 0xf) + | ((base & 0xf) << 4) + | ((offset & 0x3f) << 8); +} + + +/* Return true if INSN is a strided load in LOOP. + If it is a strided load, set the DEST, BASE and OFFSET. + Also, if this is a pre/post increment load, set PRE_POST + to true. */ + +static bool +get_load_info (rtx_insn *insn, + struct loop *loop, + bool *pre_post, + rtx *base, + rtx *dest, + rtx *offset) +{ + subrtx_var_iterator::array_type array; + if (!INSN_P (insn) || recog_memoized (insn) < 0) + return false; + rtx pat = PATTERN (insn); + switch (GET_CODE (pat)) + { + case PARALLEL: + { + for (int j = 0; j < XVECLEN (pat, 0); ++j) + { + rtx ex = XVECEXP (pat, 0, j); + FOR_EACH_SUBRTX_VAR (iter, array, ex, NONCONST) + { + const_rtx x = *iter; + if (GET_CODE (x) == SET + && strided_load_p (SET_SRC (x), loop, pre_post, + base, offset)) + { + *dest = SET_DEST (x); + return true; + } + } + } + } + break; + + case SET: + FOR_EACH_SUBRTX_VAR (iter, array, SET_SRC (pat), NONCONST) + { + rtx x = *iter; + if (strided_load_p (x, loop, pre_post, + base, offset)) + { + *dest = SET_DEST (pat); + return true; + } + } + + default: + break; + } + return false; +} + +/* Tag collision avoidance pass for Falkor. */ + +void +execute_tag_collision_avoidance () +{ + basic_block *body, bb; + struct loop *loop; + rtx_insn *insn; + + compute_bb_for_insn (); + /* Compute live regs. */ + df_compute_regs_ever_live (true); + df_analyze (); + + /* Find the loops. */ + loop_optimizer_init (AVOID_CFG_MODIFICATIONS); + calculate_dominance_info (CDI_DOMINATORS); + FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) + { + hash_map <rtx, auto_vec<rtx_insn *> > tag_map (512); + body = get_loop_body (loop); + auto_vec <rtx> tags; + + /* Record all the memory tags. */ + for (unsigned i = 0; i < loop->num_nodes; i++) + { + bb = body[i]; + FOR_BB_INSNS (bb, insn) + { + unsigned tag; + rtx base = NULL_RTX; + rtx dest = NULL_RTX; + rtx offset = NULL_RTX; + bool pre_or_post = false; + + if (!INSN_P (insn) + || DEBUG_INSN_P (insn)) + continue; + + if (get_load_info (insn, loop, &pre_or_post, + &base, &dest, &offset) + && REG_P (dest)) + { + int int_offset = 0; + if (offset && REG_P (offset)) + int_offset = (1 << 5) | REGNO (offset); + else if (offset && CONST_INT_P (offset)) + { + int_offset = INTVAL (offset); + int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant (); + if (!pre_or_post) + int_offset >>= 2; + } + tag = make_tag (REGNO (dest), REGNO (base), int_offset); + rtx t = GEN_INT (tag); + if (!tag_map.get (t)) + tags.safe_push (t); + tag_map.get_or_insert (t).safe_push (insn); + } + } + } + + for (unsigned i = 0; i < tags.length (); ++i) + { + rtx t = tags[i]; + auto_vec<rtx_insn *> *v = tag_map.get (t); + + for (int j = v->length () - 1; j > 0; --j) + { + /* Get the insns that has tags colliding. */ + rtx_insn *insn = (*v)[j]; + rtx pat; + bool changed = false; + int int_offset = 0; + rtx base = NULL_RTX; + rtx dest = NULL_RTX; + rtx offset = NULL_RTX; + bool pre_or_post = false; + + if (!get_load_info (insn, loop, &pre_or_post, + &base, &dest, &offset)) + gcc_assert (false); + + if (offset && REG_P (offset)) + int_offset = (1 << 5) | REGNO (offset); + else if (offset && CONST_INT_P (offset)) + { + int_offset = INTVAL (offset); + int_offset /= GET_MODE_SIZE (GET_MODE (dest)).to_constant (); + if (!pre_or_post) + int_offset >>= 2; + } + + /* Go over temporary registers and find a free register, if + available. */ + for (int k = R9_REGNUM; !changed && (k <= R15_REGNUM); k++) + if (!df_hard_reg_used_p (k)) + { + unsigned tag; + rtx t; + + tag = make_tag (REGNO (dest), k, int_offset); + t = GEN_INT (tag); + /* Check to see if the new tag also collides with an + existing load. */ + if (tag_map.get (t)) + continue; + + machine_mode mode = GET_MODE (base); + rtx new_reg = gen_rtx_REG (mode, k); + t = GEN_INT (make_tag (REGNO (dest), REGNO (new_reg), + int_offset)); + vec <rtx_insn *> *v2 = tag_map.get (t); + if (v2 && (v2->length () > 0)) + continue; + + /* Change the insn: dest = load (base, offset) + into tmp = base; dest = load (tmp, offset). */ + extract_insn (insn); + for (int l = 0; + (!changed) && (l < recog_data.n_operands); l++) + { + subrtx_ptr_iterator::array_type array; + rtx *op = recog_data.operand_loc[l]; + + if (recog_data.operand_type[l] == OP_OUT) + continue; + + FOR_EACH_SUBRTX_PTR (iter, array, op, NONCONST) + { + rtx *loc = *iter; + rtx x = *loc; + + if (!changed && (base == x)) + { + pat = gen_rtx_SET (new_reg, base); + if (validate_change (insn, loc, new_reg, false)) + { + emit_insn_before (pat, insn); + if (pre_or_post) + { + rtx pat2 = gen_rtx_SET (base, new_reg); + emit_insn_after (pat2, insn); + } + } + v->pop (); + tag_map.get_or_insert (t).safe_push (insn); + changed = true; + break; + } + } + } + } + } + } + } + + loop_optimizer_finalize (); +} + + +const pass_data pass_data_tag_collision_avoidance = +{ + RTL_PASS, /* type */ + "tag_collision_avoidance", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_NONE, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + TODO_df_finish, /* todo_flags_finish */ +}; + +class pass_tag_collision_avoidance : public rtl_opt_pass +{ +public: + pass_tag_collision_avoidance (gcc::context *ctxt) + : rtl_opt_pass (pass_data_tag_collision_avoidance, ctxt) + {} + + /* opt_pass methods: */ + virtual bool gate (function *) + { + return (aarch64_tune_params.extra_tuning_flags + & AARCH64_EXTRA_TUNE_AVOID_PREFETCH_TAG_COLLISION) + && optimize >= 2; + } + + virtual unsigned int execute (function *) + { + execute_tag_collision_avoidance (); + return 0; + } + +}; // class pass_tag_collision_avoidance + +/* Create a new pass instance. */ + +rtl_opt_pass * +make_pass_tag_collision_avoidance (gcc::context *ctxt) +{ + return new pass_tag_collision_avoidance (ctxt); +} diff --git a/gcc/config/aarch64/t-aarch64 b/gcc/config/aarch64/t-aarch64 index 0be1f0d..f185b40 100644 --- a/gcc/config/aarch64/t-aarch64 +++ b/gcc/config/aarch64/t-aarch64 @@ -67,6 +67,15 @@ cortex-a57-fma-steering.o: $(srcdir)/config/aarch64/cortex-a57-fma-steering.c \ $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ $(srcdir)/config/aarch64/cortex-a57-fma-steering.c +falkor-tag-collision-avoidance.o: $(srcdir)/config/aarch64/falkor-tag-collision-avoidance.c \ + $(CONFIG_H) $(SYSTEM_H) $(TM_H) $(REGS_H) insn-config.h $(RTL_BASE_H) \ + dominance.h cfg.h cfganal.h $(BASIC_BLOCK_H) $(INSN_ATTR_H) $(RECOG_H) \ + output.h hash-map.h $(DF_H) $(OBSTACK_H) $(TARGET_H) $(RTL_H) \ + $(CONTEXT_H) $(TREE_PASS_H) regrename.h \ + $(srcdir)/config/aarch64/aarch64-protos.h + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ + $(srcdir)/config/aarch64/falkor-tag-collision-avoidance.c + comma=, MULTILIB_OPTIONS = $(subst $(comma),/, $(patsubst %, mabi=%, $(subst $(comma),$(comma)mabi=,$(TM_MULTILIB_CONFIG)))) MULTILIB_DIRNAMES = $(subst $(comma), ,$(TM_MULTILIB_CONFIG))