On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford <richard.sandif...@linaro.org> wrote: > This patch adds support for in-order floating-point addition reductions, > which are suitable even in strict IEEE mode. > > Previously vect_is_simple_reduction would reject any cases that forbid > reassociation. The idea is instead to tentatively accept them as > "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target > support for them. Although this patch only handles the particular > case of plus and minus on floating-point types, there's no reason in > principle why targets couldn't handle other cases. > > The vect_force_simple_reduction change makes it simpler for parloops > to read the type of reduction. > > Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu > and powerpc64le-linux-gnu. OK to install?
I don't like that you add a new tree code for this. A new IFN looks more suitable to me. Also I think if there's a way to handle this correctly with target support you can also implement a fallback if there is no such support increasing test coverage. It would basically boil down to extracting all scalars from the non-reduction operand vector and performing a series of reduction ops, keeping the reduction PHI scalar. This would also support any reduction operator. Thanks, Richard. > Richard > > > 2017-11-17 Richard Sandiford <richard.sandif...@linaro.org> > Alan Hayward <alan.hayw...@arm.com> > David Sherwood <david.sherw...@arm.com> > > gcc/ > * tree.def (FOLD_LEFT_PLUS_EXPR): New tree code. > * doc/generic.texi (FOLD_LEFT_PLUS_EXPR): Document. > * optabs.def (fold_left_plus_optab): New optab. > * doc/md.texi (fold_left_plus_@var{m}): Document. > * doc/sourcebuild.texi (vect_fold_left_plus): Document. > * cfgexpand.c (expand_debug_expr): Handle FOLD_LEFT_PLUS_EXPR. > * expr.c (expand_expr_real_2): Likewise. > * fold-const.c (const_binop): Likewise. > * optabs-tree.c (optab_for_tree_code): Likewise. > * tree-cfg.c (verify_gimple_assign_binary): Likewise. > * tree-inline.c (estimate_operator_cost): Likewise. > * tree-pretty-print.c (dump_generic_node): Likewise. > (op_code_prio): Likewise. > (op_symbol_code): Likewise. > * tree-vect-stmts.c (vectorizable_operation): Likewise. > * tree-parloops.c (valid_reduction_p): New function. > (gather_scalar_reductions): Use it. > * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type. > (vect_finish_replace_stmt): Declare. > * tree-vect-loop.c (fold_left_reduction_code): New function. > (needs_fold_left_reduction_p): New function, split out from... > (vect_is_simple_reduction): ...here. Accept reductions that > forbid reassociation, but give them type FOLD_LEFT_REDUCTION. > (vect_force_simple_reduction): Also store the reduction type in > the assignment's STMT_VINFO_REDUC_TYPE. > (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION. > (merge_with_identity): New function. > (vectorize_fold_left_reduction): Likewise. > (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION. Leave the > scalar phi in place for it. Require target support and reject > cases that would reassociate the operation. Defer the transform > phase to vectorize_fold_left_reduction. > * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec. > * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander. > (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns. > > gcc/testsuite/ > * lib/target-supports.exp > (check_effective_target_vect_fold_left_plus): > New proc. > * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass if > vect_fold_left_plus. > * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized if > vect_fold_left_plus. > * gcc.dg/vect/trapv-vect-reduc-4.c: Expect the first loop to be > recognized as a reduction and then rejected for lack of target > support. > * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized if > vect_fold_left_plus. > * gcc.target/aarch64/sve_reduc_strict_1.c: New test. > * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise. > * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise. > * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise. > * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise. > * gcc.target/aarch64/sve_slp_13.c: Add floating-point types. > * gfortran.dg/vect/vect-8.f90: Expect 25 loops to be vectorized if > vect_fold_left_plus. > > Index: gcc/tree.def > =================================================================== > --- gcc/tree.def 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree.def 2017-11-17 16:52:07.631930981 +0000 > @@ -1302,6 +1302,8 @@ DEFTREECODE (REDUC_AND_EXPR, "reduc_and_ > DEFTREECODE (REDUC_IOR_EXPR, "reduc_ior_expr", tcc_unary, 1) > DEFTREECODE (REDUC_XOR_EXPR, "reduc_xor_expr", tcc_unary, 1) > > +DEFTREECODE (FOLD_LEFT_PLUS_EXPR, "fold_left_plus_expr", tcc_binary, 2) > + > /* Widening dot-product. > The first two arguments are of type t1. > The third argument and the result are of type t2, such that t2 is at least > Index: gcc/doc/generic.texi > =================================================================== > --- gcc/doc/generic.texi 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/doc/generic.texi 2017-11-17 16:52:07.620954871 +0000 > @@ -1746,6 +1746,7 @@ a value from @code{enum annot_expr_kind} > @tindex REDUC_AND_EXPR > @tindex REDUC_IOR_EXPR > @tindex REDUC_XOR_EXPR > +@tindex FOLD_LEFT_PLUS_EXPR > > @table @code > @item VEC_DUPLICATE_EXPR > @@ -1861,6 +1862,12 @@ the maximum element in @var{x}. The ass > is unspecified; for example, @samp{REDUC_PLUS_EXPR <@var{x}>} could > sum floating-point @var{x} in forward order, in reverse order, > using a tree, or in some other way. > + > +@item FOLD_LEFT_PLUS_EXPR > +This node takes two arguments: a scalar of type @var{t} and a vector > +of @var{t}s. It successively adds each element of the vector to the > +scalar and returns the result. The operation is strictly in-order: > +there is no reassociation. > @end table > > > Index: gcc/optabs.def > =================================================================== > --- gcc/optabs.def 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/optabs.def 2017-11-17 16:52:07.625528250 +0000 > @@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u > OPTAB_D (reduc_and_scal_optab, "reduc_and_scal_$a") > OPTAB_D (reduc_ior_scal_optab, "reduc_ior_scal_$a") > OPTAB_D (reduc_xor_scal_optab, "reduc_xor_scal_$a") > +OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a") > > OPTAB_D (extract_last_optab, "extract_last_$a") > OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a") > Index: gcc/doc/md.texi > =================================================================== > --- gcc/doc/md.texi 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/doc/md.texi 2017-11-17 16:52:07.621869547 +0000 > @@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha > one element of @var{m}. Operand 2 has the usual mask mode for vectors > of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}. > > +@cindex @code{fold_left_plus_@var{m}} instruction pattern > +@item @code{fold_left_plus_@var{m}} > +Take scalar operand 1 and successively add each element from vector > +operand 2. Store the result in scalar operand 0. The vector has > +mode @var{m} and the scalars have the mode appropriate for one > +element of @var{m}. The operation is strictly in-order: there is > +no reassociation. > + > @cindex @code{sdot_prod@var{m}} instruction pattern > @item @samp{sdot_prod@var{m}} > @cindex @code{udot_prod@var{m}} instruction pattern > Index: gcc/doc/sourcebuild.texi > =================================================================== > --- gcc/doc/sourcebuild.texi 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/doc/sourcebuild.texi 2017-11-17 16:52:07.621869547 +0000 > @@ -1580,6 +1580,9 @@ Target supports AND, IOR and XOR reducti > > @item vect_fold_extract_last > Target supports the @code{fold_extract_last} optab. > + > +@item vect_fold_left_plus > +Target supports the @code{fold_left_plus} optab. > @end table > > @subsubsection Thread Local Storage attributes > Index: gcc/cfgexpand.c > =================================================================== > --- gcc/cfgexpand.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/cfgexpand.c 2017-11-17 16:52:07.620040195 +0000 > @@ -5072,6 +5072,7 @@ expand_debug_expr (tree exp) > case REDUC_AND_EXPR: > case REDUC_IOR_EXPR: > case REDUC_XOR_EXPR: > + case FOLD_LEFT_PLUS_EXPR: > case VEC_COND_EXPR: > case VEC_PACK_FIX_TRUNC_EXPR: > case VEC_PACK_SAT_EXPR: > Index: gcc/expr.c > =================================================================== > --- gcc/expr.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/expr.c 2017-11-17 16:52:07.622784222 +0000 > @@ -9438,6 +9438,28 @@ #define REDUCE_BIT_FIELD(expr) (reduce_b > return target; > } > > + case FOLD_LEFT_PLUS_EXPR: > + { > + op0 = expand_normal (treeop0); > + op1 = expand_normal (treeop1); > + this_optab = optab_for_tree_code (code, type, optab_default); > + machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop1)); > + insn_code icode = optab_handler (this_optab, vec_mode); > + > + if (icode != CODE_FOR_nothing) > + { > + struct expand_operand ops[3]; > + create_output_operand (&ops[0], target, mode); > + create_input_operand (&ops[1], op0, mode); > + create_input_operand (&ops[2], op1, vec_mode); > + if (maybe_expand_insn (icode, 3, ops)) > + return ops[0].value; > + } > + > + /* Nothing to fall back to. */ > + gcc_unreachable (); > + } > + > case REDUC_MAX_EXPR: > case REDUC_MIN_EXPR: > case REDUC_PLUS_EXPR: > Index: gcc/fold-const.c > =================================================================== > --- gcc/fold-const.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/fold-const.c 2017-11-17 16:52:07.623698898 +0000 > @@ -1603,6 +1603,32 @@ const_binop (enum tree_code code, tree a > return NULL_TREE; > return build_vector_from_val (TREE_TYPE (arg1), sub); > } > + > + if (CONSTANT_CLASS_P (arg1) > + && TREE_CODE (arg2) == VECTOR_CST) > + { > + tree_code subcode; > + > + switch (code) > + { > + case FOLD_LEFT_PLUS_EXPR: > + subcode = PLUS_EXPR; > + break; > + default: > + return NULL_TREE; > + } > + > + int nelts = VECTOR_CST_NELTS (arg2); > + tree accum = arg1; > + for (int i = 0; i < nelts; i++) > + { > + accum = const_binop (subcode, accum, VECTOR_CST_ELT (arg2, i)); > + if (accum == NULL_TREE || !CONSTANT_CLASS_P (accum)) > + return NULL_TREE; > + } > + > + return accum; > + } > return NULL_TREE; > } > > Index: gcc/optabs-tree.c > =================================================================== > --- gcc/optabs-tree.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/optabs-tree.c 2017-11-17 16:52:07.623698898 +0000 > @@ -166,6 +166,9 @@ optab_for_tree_code (enum tree_code code > case REDUC_XOR_EXPR: > return reduc_xor_scal_optab; > > + case FOLD_LEFT_PLUS_EXPR: > + return fold_left_plus_optab; > + > case VEC_WIDEN_MULT_HI_EXPR: > return TYPE_UNSIGNED (type) ? > vec_widen_umult_hi_optab : vec_widen_smult_hi_optab; > Index: gcc/tree-cfg.c > =================================================================== > --- gcc/tree-cfg.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-cfg.c 2017-11-17 16:52:07.628272277 +0000 > @@ -4116,6 +4116,19 @@ verify_gimple_assign_binary (gassign *st > /* Continue with generic binary expression handling. */ > break; > > + case FOLD_LEFT_PLUS_EXPR: > + if (!VECTOR_TYPE_P (rhs2_type) > + || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs2_type)) > + || !useless_type_conversion_p (lhs_type, rhs1_type)) > + { > + error ("reduction should convert from vector to element type"); > + debug_generic_expr (lhs_type); > + debug_generic_expr (rhs1_type); > + debug_generic_expr (rhs2_type); > + return true; > + } > + return false; > + > case VEC_SERIES_EXPR: > if (!useless_type_conversion_p (rhs1_type, rhs2_type)) > { > Index: gcc/tree-inline.c > =================================================================== > --- gcc/tree-inline.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-inline.c 2017-11-17 16:52:07.628272277 +0000 > @@ -3881,6 +3881,7 @@ estimate_operator_cost (enum tree_code c > case REDUC_AND_EXPR: > case REDUC_IOR_EXPR: > case REDUC_XOR_EXPR: > + case FOLD_LEFT_PLUS_EXPR: > case WIDEN_SUM_EXPR: > case WIDEN_MULT_EXPR: > case DOT_PROD_EXPR: > Index: gcc/tree-pretty-print.c > =================================================================== > --- gcc/tree-pretty-print.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-pretty-print.c 2017-11-17 16:52:07.629186953 +0000 > @@ -3232,6 +3232,7 @@ dump_generic_node (pretty_printer *pp, t > break; > > case VEC_SERIES_EXPR: > + case FOLD_LEFT_PLUS_EXPR: > case VEC_WIDEN_MULT_HI_EXPR: > case VEC_WIDEN_MULT_LO_EXPR: > case VEC_WIDEN_MULT_EVEN_EXPR: > @@ -3628,6 +3629,7 @@ op_code_prio (enum tree_code code) > case REDUC_MAX_EXPR: > case REDUC_MIN_EXPR: > case REDUC_PLUS_EXPR: > + case FOLD_LEFT_PLUS_EXPR: > case VEC_UNPACK_HI_EXPR: > case VEC_UNPACK_LO_EXPR: > case VEC_UNPACK_FLOAT_HI_EXPR: > @@ -3749,6 +3751,9 @@ op_symbol_code (enum tree_code code) > case REDUC_PLUS_EXPR: > return "r+"; > > + case FOLD_LEFT_PLUS_EXPR: > + return "fl+"; > + > case WIDEN_SUM_EXPR: > return "w+"; > > Index: gcc/tree-vect-stmts.c > =================================================================== > --- gcc/tree-vect-stmts.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-vect-stmts.c 2017-11-17 16:52:07.631016305 +0000 > @@ -5415,6 +5415,10 @@ vectorizable_operation (gimple *stmt, gi > > code = gimple_assign_rhs_code (stmt); > > + /* Ignore operations that mix scalar and vector input operands. */ > + if (code == FOLD_LEFT_PLUS_EXPR) > + return false; > + > /* For pointer addition, we should use the normal plus for > the vector addition. */ > if (code == POINTER_PLUS_EXPR) > Index: gcc/tree-parloops.c > =================================================================== > --- gcc/tree-parloops.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-parloops.c 2017-11-17 16:52:07.629186953 +0000 > @@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo > return 1; > } > > +/* Return true if the type of reduction performed by STMT is suitable > + for this pass. */ > + > +static bool > +valid_reduction_p (gimple *stmt) > +{ > + /* Parallelization would reassociate the operation, which isn't > + allowed for in-order reductions. */ > + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > + vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info); > + return reduc_type != FOLD_LEFT_REDUCTION; > +} > + > /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST. */ > > static void > @@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r > gimple *reduc_stmt > = vect_force_simple_reduction (simple_loop_info, phi, > &double_reduc, true); > - if (!reduc_stmt) > + if (!reduc_stmt || !valid_reduction_p (reduc_stmt)) > continue; > > if (double_reduc) > @@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r > = vect_force_simple_reduction (simple_loop_info, inner_phi, > &double_reduc, true); > gcc_assert (!double_reduc); > - if (inner_reduc_stmt == NULL) > + if (inner_reduc_stmt == NULL > + || !valid_reduction_p (inner_reduc_stmt)) > continue; > > build_new_reduction (reduction_list, double_reduc_stmts[i], > phi); > Index: gcc/tree-vectorizer.h > =================================================================== > --- gcc/tree-vectorizer.h 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-vectorizer.h 2017-11-17 16:52:07.631016305 +0000 > @@ -74,7 +74,15 @@ enum vect_reduction_type { > > for (int i = 0; i < VF; ++i) > res = cond[i] ? val[i] : res; */ > - EXTRACT_LAST_REDUCTION > + EXTRACT_LAST_REDUCTION, > + > + /* Use a folding reduction within the loop to implement: > + > + for (int i = 0; i < VF; ++i) > + res = res OP val[i]; > + > + (with no reassocation). */ > + FOLD_LEFT_REDUCTION > }; > > #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def) \ > @@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v > extern unsigned record_stmt_cost (stmt_vector_for_cost *, int, > enum vect_cost_for_stmt, stmt_vec_info, > int, enum vect_cost_model_location); > +extern void vect_finish_replace_stmt (gimple *, gimple *); > extern void vect_finish_stmt_generation (gimple *, gimple *, > gimple_stmt_iterator *); > extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info); > Index: gcc/tree-vect-loop.c > =================================================================== > --- gcc/tree-vect-loop.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/tree-vect-loop.c 2017-11-17 16:52:07.630101629 +0000 > @@ -2573,6 +2573,29 @@ vect_analyze_loop (struct loop *loop, lo > } > } > > +/* Return true if the target supports in-order reductions for operation > + CODE and type TYPE. If the target supports it, store the reduction > + operation in *REDUC_CODE. */ > + > +static bool > +fold_left_reduction_code (tree_code code, tree type, tree_code *reduc_code) > +{ > + switch (code) > + { > + case PLUS_EXPR: > + code = FOLD_LEFT_PLUS_EXPR; > + break; > + > + default: > + return false; > + } > + > + if (!target_supports_op_p (type, code, optab_vector)) > + return false; > + > + *reduc_code = code; > + return true; > +} > > /* Function reduction_code_for_scalar_code > > @@ -2880,6 +2903,42 @@ vect_is_slp_reduction (loop_vec_info loo > return true; > } > > +/* Returns true if we need an in-order reduction for operation CODE > + on type TYPE. NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer > + overflow must wrap. */ > + > +static bool > +needs_fold_left_reduction_p (tree type, tree_code code, > + bool need_wrapping_integral_overflow) > +{ > + /* CHECKME: check for !flag_finite_math_only too? */ > + if (SCALAR_FLOAT_TYPE_P (type)) > + switch (code) > + { > + case MIN_EXPR: > + case MAX_EXPR: > + return false; > + > + default: > + return !flag_associative_math; > + } > + > + if (INTEGRAL_TYPE_P (type)) > + { > + if (!operation_no_trapping_overflow (type, code)) > + return true; > + if (need_wrapping_integral_overflow > + && !TYPE_OVERFLOW_WRAPS (type) > + && operation_can_overflow (code)) > + return true; > + return false; > + } > + > + if (SAT_FIXED_POINT_TYPE_P (type)) > + return true; > + > + return false; > +} > > /* Function vect_is_simple_reduction > > @@ -3198,58 +3257,18 @@ vect_is_simple_reduction (loop_vec_info > return NULL; > } > > - /* Check that it's ok to change the order of the computation. > + /* Check whether it's ok to change the order of the computation. > Generally, when vectorizing a reduction we change the order of the > computation. This may change the behavior of the program in some > cases, so we need to check that this is ok. One exception is when > vectorizing an outer-loop: the inner-loop is executed sequentially, > and therefore vectorizing reductions in the inner-loop during > outer-loop vectorization is safe. */ > - > - if (*v_reduc_type != COND_REDUCTION > - && check_reduction) > - { > - /* CHECKME: check for !flag_finite_math_only too? */ > - if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math) > - { > - /* Changing the order of operations changes the semantics. */ > - if (dump_enabled_p ()) > - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, > - "reduction: unsafe fp math optimization: "); > - return NULL; > - } > - else if (INTEGRAL_TYPE_P (type)) > - { > - if (!operation_no_trapping_overflow (type, code)) > - { > - /* Changing the order of operations changes the semantics. */ > - if (dump_enabled_p ()) > - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, > - "reduction: unsafe int math optimization" > - " (overflow traps): "); > - return NULL; > - } > - if (need_wrapping_integral_overflow > - && !TYPE_OVERFLOW_WRAPS (type) > - && operation_can_overflow (code)) > - { > - /* Changing the order of operations changes the semantics. */ > - if (dump_enabled_p ()) > - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, > - "reduction: unsafe int math optimization" > - " (overflow doesn't wrap): "); > - return NULL; > - } > - } > - else if (SAT_FIXED_POINT_TYPE_P (type)) > - { > - /* Changing the order of operations changes the semantics. */ > - if (dump_enabled_p ()) > - report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt, > - "reduction: unsafe fixed-point math optimization: > "); > - return NULL; > - } > - } > + if (check_reduction > + && *v_reduc_type == TREE_CODE_REDUCTION > + && needs_fold_left_reduction_p (type, code, > + need_wrapping_integral_overflow)) > + *v_reduc_type = FOLD_LEFT_REDUCTION; > > /* Reduction is safe. We're dealing with one of the following: > 1) integer arithmetic and no trapv > @@ -3513,6 +3532,7 @@ vect_force_simple_reduction (loop_vec_in > STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type; > STMT_VINFO_REDUC_DEF (reduc_def_info) = def; > reduc_def_info = vinfo_for_stmt (def); > + STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type; > STMT_VINFO_REDUC_DEF (reduc_def_info) = phi; > } > return def; > @@ -4065,7 +4085,8 @@ vect_model_reduction_cost (stmt_vec_info > > code = gimple_assign_rhs_code (orig_stmt); > > - if (reduction_type == EXTRACT_LAST_REDUCTION) > + if (reduction_type == EXTRACT_LAST_REDUCTION > + || reduction_type == FOLD_LEFT_REDUCTION) > { > /* No extra instructions needed in the prologue. */ > prologue_cost = 0; > @@ -4138,7 +4159,8 @@ vect_model_reduction_cost (stmt_vec_info > scalar_stmt, stmt_info, 0, > vect_epilogue); > } > - else if (reduction_type == EXTRACT_LAST_REDUCTION) > + else if (reduction_type == EXTRACT_LAST_REDUCTION > + || reduction_type == FOLD_LEFT_REDUCTION) > /* No extra instructions need in the epilogue. */ > ; > else > @@ -5884,6 +5906,155 @@ vect_create_epilog_for_reduction (vec<tr > } > } > > +/* Return a vector of type VECTYPE that is equal to the vector select > + operation "MASK ? VEC : IDENTITY". Insert the select statements > + before GSI. */ > + > +static tree > +merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype, > + tree vec, tree identity) > +{ > + tree cond = make_temp_ssa_name (vectype, NULL, "cond"); > + gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR, > + mask, vec, identity); > + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); > + return cond; > +} > + > +/* Perform an in-order reduction (FOLD_LEFT_REDUCTION). STMT is the > + statement that sets the live-out value. REDUC_DEF_STMT is the phi > + statement. CODE is the operation performed by STMT and OPS are > + its scalar operands. REDUC_INDEX is the index of the operand in > + OPS that is set by REDUC_DEF_STMT. REDUC_CODE is the code that > + implements in-order reduction and VECTYPE_IN is the type of its > + vector input. MASKS specifies the masks that should be used to > + control the operation in a fully-masked loop. */ > + > +static bool > +vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi, > + gimple **vec_stmt, slp_tree slp_node, > + gimple *reduc_def_stmt, > + tree_code code, tree_code reduc_code, > + tree ops[3], tree vectype_in, > + int reduc_index, vec_loop_masks *masks) > +{ > + stmt_vec_info stmt_info = vinfo_for_stmt (stmt); > + loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info); > + struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); > + tree vectype_out = STMT_VINFO_VECTYPE (stmt_info); > + gimple *new_stmt = NULL; > + > + int ncopies; > + if (slp_node) > + ncopies = 1; > + else > + ncopies = vect_get_num_copies (loop_vinfo, vectype_in); > + > + gcc_assert (!nested_in_vect_loop_p (loop, stmt)); > + gcc_assert (ncopies == 1); > + gcc_assert (TREE_CODE_LENGTH (code) == binary_op); > + gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1)); > + gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info) > + == FOLD_LEFT_REDUCTION); > + > + if (slp_node) > + gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out), > + TYPE_VECTOR_SUBPARTS (vectype_in))); > + > + tree op0 = ops[1 - reduc_index]; > + > + int group_size = 1; > + gimple *scalar_dest_def; > + auto_vec<tree> vec_oprnds0; > + if (slp_node) > + { > + vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node); > + group_size = SLP_TREE_SCALAR_STMTS (slp_node).length (); > + scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1]; > + } > + else > + { > + tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt); > + vec_oprnds0.create (1); > + vec_oprnds0.quick_push (loop_vec_def0); > + scalar_dest_def = stmt; > + } > + > + tree scalar_dest = gimple_assign_lhs (scalar_dest_def); > + tree scalar_type = TREE_TYPE (scalar_dest); > + tree reduc_var = gimple_phi_result (reduc_def_stmt); > + > + int vec_num = vec_oprnds0.length (); > + gcc_assert (vec_num == 1 || slp_node); > + tree vec_elem_type = TREE_TYPE (vectype_out); > + gcc_checking_assert (useless_type_conversion_p (scalar_type, > vec_elem_type)); > + > + tree vector_identity = NULL_TREE; > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + vector_identity = build_zero_cst (vectype_out); > + > + int i; > + tree def0; > + FOR_EACH_VEC_ELT (vec_oprnds0, i, def0) > + { > + tree mask = NULL_TREE; > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i); > + > + /* Handle MINUS by adding the negative. */ > + if (code == MINUS_EXPR) > + { > + tree negated = make_ssa_name (vectype_out); > + new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0); > + gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT); > + def0 = negated; > + } > + > + if (mask) > + def0 = merge_with_identity (gsi, mask, vectype_out, def0, > + vector_identity); > + > + /* On the first iteration the input is simply the scalar phi > + result, and for subsequent iterations it is the output of > + the preceding operation. */ > + tree expr = build2 (reduc_code, scalar_type, reduc_var, def0); > + > + /* For chained SLP reductions the output of the previous reduction > + operation serves as the input of the next. For the final statement > + the output cannot be a temporary - we reuse the original > + scalar destination of the last statement. */ > + if (i == vec_num - 1) > + reduc_var = scalar_dest; > + else > + reduc_var = vect_create_destination_var (scalar_dest, NULL); > + new_stmt = gimple_build_assign (reduc_var, expr); > + > + if (i == vec_num - 1) > + { > + SSA_NAME_DEF_STMT (reduc_var) = new_stmt; > + /* For chained SLP stmt is the first statement in the group and > + gsi points to the last statement in the group. For non SLP stmt > + points to the same location as gsi. In either case tmp_gsi and > gsi > + should both point to the same insertion point. */ > + gcc_assert (scalar_dest_def == gsi_stmt (*gsi)); > + vect_finish_replace_stmt (scalar_dest_def, new_stmt); > + } > + else > + { > + reduc_var = make_ssa_name (reduc_var, new_stmt); > + gimple_assign_set_lhs (new_stmt, reduc_var); > + vect_finish_stmt_generation (stmt, new_stmt, gsi); > + } > + > + if (slp_node) > + SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt); > + } > + > + if (!slp_node) > + STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt; > + > + return true; > +} > > /* Function is_nonwrapping_integer_induction. > > @@ -6063,6 +6234,12 @@ vectorizable_reduction (gimple *stmt, gi > return true; > } > > + if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION) > + /* Leave the scalar phi in place. Note that checking > + STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works > + for reductions involving a single statement. */ > + return true; > + > gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info); > if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt))) > reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt)); > @@ -6289,6 +6466,14 @@ vectorizable_reduction (gimple *stmt, gi > directy used in stmt. */ > if (reduc_index == -1) > { > + if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "in-order reduction chain without SLP.\n"); > + return false; > + } > + > if (orig_stmt) > reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info); > else > @@ -6508,7 +6693,9 @@ vectorizable_reduction (gimple *stmt, gi > > vect_reduction_type reduction_type > = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info); > - if (orig_stmt && reduction_type == TREE_CODE_REDUCTION) > + if (orig_stmt > + && (reduction_type == TREE_CODE_REDUCTION > + || reduction_type == FOLD_LEFT_REDUCTION)) > { > /* This is a reduction pattern: get the vectype from the type of the > reduction variable, and get the tree-code from orig_stmt. */ > @@ -6555,13 +6742,22 @@ vectorizable_reduction (gimple *stmt, gi > epilog_reduc_code = ERROR_MARK; > > if (reduction_type == TREE_CODE_REDUCTION > + || reduction_type == FOLD_LEFT_REDUCTION > || reduction_type == INTEGER_INDUC_COND_REDUCTION > || reduction_type == CONST_COND_REDUCTION) > { > - if (reduction_code_for_scalar_code (orig_code, &epilog_reduc_code)) > + bool have_reduc_support; > + if (reduction_type == FOLD_LEFT_REDUCTION) > + have_reduc_support = fold_left_reduction_code (orig_code, vectype_out, > + &epilog_reduc_code); > + else > + have_reduc_support > + = reduction_code_for_scalar_code (orig_code, &epilog_reduc_code); > + > + if (have_reduc_support) > { > reduc_optab = optab_for_tree_code (epilog_reduc_code, vectype_out, > - optab_default); > + optab_default); > if (!reduc_optab) > { > if (dump_enabled_p ()) > @@ -6687,6 +6883,41 @@ vectorizable_reduction (gimple *stmt, gi > } > } > > + if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION) > + { > + /* We can't support in-order reductions of code such as this: > + > + for (int i = 0; i < n1; ++i) > + for (int j = 0; j < n2; ++j) > + l += a[j]; > + > + since GCC effectively transforms the loop when vectorizing: > + > + for (int i = 0; i < n1 / VF; ++i) > + for (int j = 0; j < n2; ++j) > + for (int k = 0; k < VF; ++k) > + l += a[j]; > + > + which is a reassociation of the original operation. */ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "in-order double reduction not supported.\n"); > + > + return false; > + } > + > + if (reduction_type == FOLD_LEFT_REDUCTION > + && slp_node > + && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt))) > + { > + /* We cannot in-order reductions in this case because there is > + an implicit reassociation of the operations involved. */ > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "in-order unchained SLP reductions not > supported.\n"); > + return false; > + } > + > /* In case of widenning multiplication by a constant, we update the type > of the constant to be the type of the other operand. We check that the > constant fits the type in the pattern recognition pass. */ > @@ -6807,9 +7038,10 @@ vectorizable_reduction (gimple *stmt, gi > vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies); > if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo)) > { > - if (cond_fn == IFN_LAST > - || !direct_internal_fn_supported_p (cond_fn, vectype_in, > - OPTIMIZE_FOR_SPEED)) > + if (reduction_type != FOLD_LEFT_REDUCTION > + && (cond_fn == IFN_LAST > + || !direct_internal_fn_supported_p (cond_fn, vectype_in, > + OPTIMIZE_FOR_SPEED))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -6844,6 +7076,11 @@ vectorizable_reduction (gimple *stmt, gi > > bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); > > + if (reduction_type == FOLD_LEFT_REDUCTION) > + return vectorize_fold_left_reduction > + (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code, > + epilog_reduc_code, ops, vectype_in, reduc_index, masks); > + > if (reduction_type == EXTRACT_LAST_REDUCTION) > { > gcc_assert (!slp_node); > Index: gcc/config/aarch64/aarch64.md > =================================================================== > --- gcc/config/aarch64/aarch64.md 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/config/aarch64/aarch64.md 2017-11-17 16:52:07.620954871 +0000 > @@ -164,6 +164,7 @@ (define_c_enum "unspec" [ > UNSPEC_STN > UNSPEC_INSR > UNSPEC_CLASTB > + UNSPEC_FADDA > ]) > > (define_c_enum "unspecv" [ > Index: gcc/config/aarch64/aarch64-sve.md > =================================================================== > --- gcc/config/aarch64/aarch64-sve.md 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/config/aarch64/aarch64-sve.md 2017-11-17 16:52:07.620040195 +0000 > @@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode> > "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>" > ) > > +;; Unpredicated in-order FP reductions. > +(define_expand "fold_left_plus_<mode>" > + [(set (match_operand:<VEL> 0 "register_operand") > + (unspec:<VEL> [(match_dup 3) > + (match_operand:<VEL> 1 "register_operand") > + (match_operand:SVE_F 2 "register_operand")] > + UNSPEC_FADDA))] > + "TARGET_SVE" > + { > + operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode)); > + } > +) > + > +;; In-order FP reductions predicated with PTRUE. > +(define_insn "*fold_left_plus_<mode>" > + [(set (match_operand:<VEL> 0 "register_operand" "=w") > + (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl") > + (match_operand:<VEL> 2 "register_operand" "0") > + (match_operand:SVE_F 3 "register_operand" "w")] > + UNSPEC_FADDA))] > + "TARGET_SVE" > + "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>" > +) > + > +;; Predicated form of the above in-order reduction. > +(define_insn "*pred_fold_left_plus_<mode>" > + [(set (match_operand:<VEL> 0 "register_operand" "=w") > + (unspec:<VEL> > + [(match_operand:<VEL> 1 "register_operand" "0") > + (unspec:SVE_F > + [(match_operand:<VPRED> 2 "register_operand" "Upl") > + (match_operand:SVE_F 3 "register_operand" "w") > + (match_operand:SVE_F 4 "aarch64_simd_imm_zero")] > + UNSPEC_SEL)] > + UNSPEC_FADDA))] > + "TARGET_SVE" > + "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>" > +) > + > ;; Unpredicated floating-point addition. > (define_expand "add<mode>3" > [(set (match_operand:SVE_F 0 "register_operand") > Index: gcc/testsuite/lib/target-supports.exp > =================================================================== > --- gcc/testsuite/lib/target-supports.exp 2017-11-17 16:52:07.246852461 > +0000 > +++ gcc/testsuite/lib/target-supports.exp 2017-11-17 16:52:07.627357602 > +0000 > @@ -7180,6 +7180,12 @@ proc check_effective_target_vect_fold_ex > return [check_effective_target_aarch64_sve] > } > > +# Return 1 if the target supports the fold_left_plus optab. > + > +proc check_effective_target_vect_fold_left_plus { } { > + return [check_effective_target_aarch64_sve] > +} > + > # Return 1 if the target supports section-anchors > > proc check_effective_target_section_anchors { } { > Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c > =================================================================== > --- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c 2017-11-17 > 16:52:07.246852461 +0000 > +++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c 2017-11-17 > 16:52:07.625528250 +0000 > @@ -34,4 +34,4 @@ int main (void) > } > > /* Requires fast-math. */ > -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail > *-*-* } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { > ! vect_fold_left_plus } } } } */ > Index: gcc/testsuite/gcc.dg/vect/pr79920.c > =================================================================== > --- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.246852461 +0000 > +++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.625528250 +0000 > @@ -41,4 +41,5 @@ int main() > return 0; > } > > -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target > { vect_double && { vect_perm && vect_hw_misalign } } } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target > { { vect_double && { ! vect_fold_left_plus } } && { vect_perm && > vect_hw_misalign } } } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target > { { vect_double && vect_fold_left_plus } && { vect_perm && vect_hw_misalign } > } } } } */ > Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c > =================================================================== > --- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c 2017-11-17 > 16:52:07.246852461 +0000 > +++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c 2017-11-17 > 16:52:07.625528250 +0000 > @@ -46,5 +46,9 @@ int main (void) > return 0; > } > > -/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect" } } */ > +/* 2 for the first loop. */ > +/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { > target { ! vect_multiple_sizes } } } } */ > +/* { dg-final { scan-tree-dump "Detected reduction\\." "vect" { target > vect_multiple_sizes } } } */ > +/* { dg-final { scan-tree-dump-times "not vectorized" 1 "vect" { target { ! > vect_multiple_sizes } } } } */ > +/* { dg-final { scan-tree-dump "not vectorized" "vect" { target > vect_multiple_sizes } } } */ > /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target > { ! vect_no_int_min_max } } } } */ > Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c > =================================================================== > --- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c 2017-11-17 16:52:07.246852461 > +0000 > +++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c 2017-11-17 16:52:07.625528250 > +0000 > @@ -50,4 +50,5 @@ int main (void) > > /* need -ffast-math to vectorizer these loops. */ > /* ARM NEON passes -ffast-math to these tests, so expect this to fail. */ > -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail > arm_neon_ok } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target > { ! vect_fold_left_plus } xfail arm_neon_ok } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target > vect_fold_left_plus } } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c > =================================================================== > --- /dev/null 2017-11-14 14:28:07.424493901 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c 2017-11-17 > 16:52:07.625528250 +0000 > @@ -0,0 +1,28 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ > + > +#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3)) > + > +#define DEF_REDUC_PLUS(TYPE) \ > + TYPE __attribute__ ((noinline, noclone)) \ > + reduc_plus_##TYPE (TYPE *a, TYPE *b) \ > + { \ > + TYPE r = 0, q = 3; \ > + for (int i = 0; i < NUM_ELEMS(TYPE); i++) \ > + { \ > + r += a[i]; \ > + q -= b[i]; \ > + } \ > + return r * q; \ > + } > + > +#define TEST_ALL(T) \ > + T (_Float16) \ > + T (float) \ > + T (double) > + > +TEST_ALL (DEF_REDUC_PLUS) > + > +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, > z[0-9]+\.h} 2 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, > z[0-9]+\.s} 2 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, > z[0-9]+\.d} 2 } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c > =================================================================== > --- /dev/null 2017-11-14 14:28:07.424493901 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c 2017-11-17 > 16:52:07.625528250 +0000 > @@ -0,0 +1,29 @@ > +/* { dg-do run { target { aarch64_sve_hw } } } */ > +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ > + > +#include "sve_reduc_strict_1.c" > + > +#define TEST_REDUC_PLUS(TYPE) \ > + { \ > + TYPE a[NUM_ELEMS (TYPE)]; \ > + TYPE b[NUM_ELEMS (TYPE)]; \ > + TYPE r = 0, q = 3; \ > + for (int i = 0; i < NUM_ELEMS (TYPE); i++) \ > + { \ > + a[i] = (i * 0.1) * (i & 1 ? 1 : -1); \ > + b[i] = (i * 0.3) * (i & 1 ? 1 : -1); \ > + r += a[i]; \ > + q -= b[i]; \ > + asm volatile ("" ::: "memory"); \ > + } \ > + TYPE res = reduc_plus_##TYPE (a, b); \ > + if (res != r * q) \ > + __builtin_abort (); \ > + } > + > +int __attribute__ ((optimize (1))) > +main () > +{ > + TEST_ALL (TEST_REDUC_PLUS); > + return 0; > +} > Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c > =================================================================== > --- /dev/null 2017-11-14 14:28:07.424493901 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c 2017-11-17 > 16:52:07.625528250 +0000 > @@ -0,0 +1,28 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */ > + > +#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3)) > + > +#define DEF_REDUC_PLUS(TYPE) \ > +void __attribute__ ((noinline, noclone)) \ > +reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS(TYPE)], \ > + TYPE *restrict r, int n) \ > +{ \ > + for (int i = 0; i < n; i++) \ > + { \ > + r[i] = 0; \ > + for (int j = 0; j < NUM_ELEMS(TYPE); j++) \ > + r[i] += a[i][j]; \ > + } \ > +} > + > +#define TEST_ALL(T) \ > + T (_Float16) \ > + T (float) \ > + T (double) > + > +TEST_ALL (DEF_REDUC_PLUS) > + > +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, > z[0-9]+\.h} 1 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, > z[0-9]+\.s} 1 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, > z[0-9]+\.d} 1 } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c > =================================================================== > --- /dev/null 2017-11-14 14:28:07.424493901 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c 2017-11-17 > 16:52:07.626442926 +0000 > @@ -0,0 +1,31 @@ > +/* { dg-do run { target { aarch64_sve_hw } } } */ > +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */ > + > +#include "sve_reduc_strict_2.c" > + > +#define NROWS 5 > + > +#define TEST_REDUC_PLUS(TYPE) \ > + { \ > + TYPE a[NROWS][NUM_ELEMS (TYPE)]; \ > + TYPE r[NROWS]; \ > + TYPE expected[NROWS] = {}; \ > + for (int i = 0; i < NROWS; ++i) \ > + for (int j = 0; j < NUM_ELEMS (TYPE); ++j) \ > + { \ > + a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1); \ > + expected[i] += a[i][j]; \ > + asm volatile ("" ::: "memory"); \ > + } \ > + reduc_plus_##TYPE (a, r, NROWS); \ > + for (int i = 0; i < NROWS; ++i) \ > + if (r[i] != expected[i]) \ > + __builtin_abort (); \ > + } > + > +int __attribute__ ((optimize (1))) > +main () > +{ > + TEST_ALL (TEST_REDUC_PLUS); > + return 0; > +} > Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c > =================================================================== > --- /dev/null 2017-11-14 14:28:07.424493901 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c 2017-11-17 > 16:52:07.626442926 +0000 > @@ -0,0 +1,131 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve > -msve-vector-bits=256 -fdump-tree-vect-details" } */ > + > +double mat[100][4]; > +double mat2[100][8]; > +double mat3[100][12]; > +double mat4[100][3]; > + > +double > +slp_reduc_plus (int n) > +{ > + double tmp = 0.0; > + for (int i = 0; i < n; i++) > + { > + tmp = tmp + mat[i][0]; > + tmp = tmp + mat[i][1]; > + tmp = tmp + mat[i][2]; > + tmp = tmp + mat[i][3]; > + } > + return tmp; > +} > + > +double > +slp_reduc_plus2 (int n) > +{ > + double tmp = 0.0; > + for (int i = 0; i < n; i++) > + { > + tmp = tmp + mat2[i][0]; > + tmp = tmp + mat2[i][1]; > + tmp = tmp + mat2[i][2]; > + tmp = tmp + mat2[i][3]; > + tmp = tmp + mat2[i][4]; > + tmp = tmp + mat2[i][5]; > + tmp = tmp + mat2[i][6]; > + tmp = tmp + mat2[i][7]; > + } > + return tmp; > +} > + > +double > +slp_reduc_plus3 (int n) > +{ > + double tmp = 0.0; > + for (int i = 0; i < n; i++) > + { > + tmp = tmp + mat3[i][0]; > + tmp = tmp + mat3[i][1]; > + tmp = tmp + mat3[i][2]; > + tmp = tmp + mat3[i][3]; > + tmp = tmp + mat3[i][4]; > + tmp = tmp + mat3[i][5]; > + tmp = tmp + mat3[i][6]; > + tmp = tmp + mat3[i][7]; > + tmp = tmp + mat3[i][8]; > + tmp = tmp + mat3[i][9]; > + tmp = tmp + mat3[i][10]; > + tmp = tmp + mat3[i][11]; > + } > + return tmp; > +} > + > +void > +slp_non_chained_reduc (int n, double * restrict out) > +{ > + for (int i = 0; i < 3; i++) > + out[i] = 0; > + > + for (int i = 0; i < n; i++) > + { > + out[0] = out[0] + mat4[i][0]; > + out[1] = out[1] + mat4[i][1]; > + out[2] = out[2] + mat4[i][2]; > + } > +} > + > +/* Strict FP reductions shouldn't be used for the outer loops, only the > + inner loops. */ > + > +float > +double_reduc1 (float (*restrict i)[16]) > +{ > + float l = 0; > + > + for (int a = 0; a < 8; a++) > + for (int b = 0; b < 8; b++) > + l += i[b][a]; > + return l; > +} > + > +float > +double_reduc2 (float *restrict i) > +{ > + float l = 0; > + > + for (int a = 0; a < 8; a++) > + for (int b = 0; b < 16; b++) > + { > + l += i[b * 4]; > + l += i[b * 4 + 1]; > + l += i[b * 4 + 2]; > + l += i[b * 4 + 3]; > + } > + return l; > +} > + > +float > +double_reduc3 (float *restrict i, float *restrict j) > +{ > + float k = 0, l = 0; > + > + for (int a = 0; a < 8; a++) > + for (int b = 0; b < 8; b++) > + { > + k += i[b]; > + l += j[b]; > + } > + return l * k; > +} > + > +/* We can't yet handle double_reduc1. */ > +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, > z[0-9]+\.s} 3 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, > z[0-9]+\.d} 9 } } */ > +/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3. Each one > + is reported three times, once for SVE, once for 128-bit AdvSIMD and once > + for 64-bit AdvSIMD. */ > +/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } > } */ > +/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3. > + double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD) > + before failing. */ > +/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */ > Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c > =================================================================== > --- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c 2017-11-17 > 16:52:07.246852461 +0000 > +++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c 2017-11-17 > 16:52:07.626442926 +0000 > @@ -1,5 +1,6 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve > -msve-vector-bits=scalable" } */ > +/* The cost model thinks that the double loop isn't a win for SVE-128. */ > +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve > -msve-vector-bits=scalable -fno-vect-cost-model" } */ > > #include <stdint.h> > > @@ -24,7 +25,10 @@ #define TEST_ALL(T) \ > T (int32_t) \ > T (uint32_t) \ > T (int64_t) \ > - T (uint64_t) > + T (uint64_t) \ > + T (_Float16) \ > + T (float) \ > + T (double) > > TEST_ALL (VEC_PERM) > > @@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM) > /* ??? We don't treat the uint loops as SLP. */ > /* The loop should be fully-masked. */ > /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */ > -/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */ > +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */ > +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */ > /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */ > > /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* > } } } */ > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* > } } } */ > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */ > -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */ > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* > } } } */ > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */ > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */ > > /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], > z[0-9]+\.b\n} 2 { xfail *-*-* } } } */ > /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], > z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ > /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], > z[0-9]+\.s\n} 2 } } */ > /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], > z[0-9]+\.d\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, > z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, > z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, > z[0-9]+\.d\n} 1 } } */ > +/* { dg-final { scan-assembler-not {\tfadd\n} } } */ > > /* { dg-final { scan-assembler-not {\tuqdec} } } */ > Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90 > =================================================================== > --- gcc/testsuite/gfortran.dg/vect/vect-8.f90 2017-11-17 16:52:07.246852461 > +0000 > +++ gcc/testsuite/gfortran.dg/vect/vect-8.f90 2017-11-17 16:52:07.626442926 > +0000 > @@ -704,5 +704,6 @@ CALL track('KERNEL ') > RETURN > END SUBROUTINE kernel > > -! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target > { vect_intdouble_cvt } } } } > ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target > { ! vect_intdouble_cvt } } } } > +! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target > { vect_intdouble_cvt && { ! vect_fold_left_plus } } } } } > +! { dg-final { scan-tree-dump-times "vectorized 25 loops" 1 "vect" { target > { vect_intdouble_cvt && vect_fold_left_plus } } } }