Re: Add support for in-order addition reduction using SVE FADDA

Richard Biener Mon, 20 Nov 2017 03:36:50 -0800

On Fri, Nov 17, 2017 at 5:53 PM, Richard Sandiford
<[email protected]> wrote:
> This patch adds support for in-order floating-point addition reductions,
> which are suitable even in strict IEEE mode.
>
> Previously vect_is_simple_reduction would reject any cases that forbid
> reassociation.  The idea is instead to tentatively accept them as
> "FOLD_LEFT_REDUCTIONs" and only fail later if there is no target
> support for them.  Although this patch only handles the particular
> case of plus and minus on floating-point types, there's no reason in
> principle why targets couldn't handle other cases.
>
> The vect_force_simple_reduction change makes it simpler for parloops
> to read the type of reduction.
>
> Tested on aarch64-linux-gnu (with and without SVE), x86_64-linux-gnu
> and powerpc64le-linux-gnu.  OK to install?


I don't like that you add a new tree code for this.  A new IFN looks more
suitable to me.

Also I think if there's a way to handle this correctly with target support
you can also implement a fallback if there is no such support increasing
test coverage.  It would basically boil down to extracting all scalars from
the non-reduction operand vector and performing a series of reduction
ops, keeping the reduction PHI scalar.  This would also support any
reduction operator.

Thanks,
Richard.

> Richard
>
>
> 2017-11-17  Richard Sandiford  <[email protected]>
>             Alan Hayward  <[email protected]>
>             David Sherwood  <[email protected]>
>
> gcc/
>         * tree.def (FOLD_LEFT_PLUS_EXPR): New tree code.
>         * doc/generic.texi (FOLD_LEFT_PLUS_EXPR): Document.
>         * optabs.def (fold_left_plus_optab): New optab.
>         * doc/md.texi (fold_left_plus_@var{m}): Document.
>         * doc/sourcebuild.texi (vect_fold_left_plus): Document.
>         * cfgexpand.c (expand_debug_expr): Handle FOLD_LEFT_PLUS_EXPR.
>         * expr.c (expand_expr_real_2): Likewise.
>         * fold-const.c (const_binop): Likewise.
>         * optabs-tree.c (optab_for_tree_code): Likewise.
>         * tree-cfg.c (verify_gimple_assign_binary): Likewise.
>         * tree-inline.c (estimate_operator_cost): Likewise.
>         * tree-pretty-print.c (dump_generic_node): Likewise.
>         (op_code_prio): Likewise.
>         (op_symbol_code): Likewise.
>         * tree-vect-stmts.c (vectorizable_operation): Likewise.
>         * tree-parloops.c (valid_reduction_p): New function.
>         (gather_scalar_reductions): Use it.
>         * tree-vectorizer.h (FOLD_LEFT_REDUCTION): New vect_reduction_type.
>         (vect_finish_replace_stmt): Declare.
>         * tree-vect-loop.c (fold_left_reduction_code): New function.
>         (needs_fold_left_reduction_p): New function, split out from...
>         (vect_is_simple_reduction): ...here.  Accept reductions that
>         forbid reassociation, but give them type FOLD_LEFT_REDUCTION.
>         (vect_force_simple_reduction): Also store the reduction type in
>         the assignment's STMT_VINFO_REDUC_TYPE.
>         (vect_model_reduction_cost): Handle FOLD_LEFT_REDUCTION.
>         (merge_with_identity): New function.
>         (vectorize_fold_left_reduction): Likewise.
>         (vectorizable_reduction): Handle FOLD_LEFT_REDUCTION.  Leave the
>         scalar phi in place for it.  Require target support and reject
>         cases that would reassociate the operation.  Defer the transform
>         phase to vectorize_fold_left_reduction.
>         * config/aarch64/aarch64.md (UNSPEC_FADDA): New unspec.
>         * config/aarch64/aarch64-sve.md (fold_left_plus_<mode>): New expander.
>         (*fold_left_plus_<mode>, *pred_fold_left_plus_<mode>): New insns.
>
> gcc/testsuite/
>         * lib/target-supports.exp 
> (check_effective_target_vect_fold_left_plus):
>         New proc.
>         * gcc.dg/vect/no-fast-math-vect16.c: Expect the test to pass if
>         vect_fold_left_plus.
>         * gcc.dg/vect/pr79920.c: Expect both loops to be vectorized if
>         vect_fold_left_plus.
>         * gcc.dg/vect/trapv-vect-reduc-4.c: Expect the first loop to be
>         recognized as a reduction and then rejected for lack of target
>         support.
>         * gcc.dg/vect/vect-reduc-6.c: Expect the loop to be vectorized if
>         vect_fold_left_plus.
>         * gcc.target/aarch64/sve_reduc_strict_1.c: New test.
>         * gcc.target/aarch64/sve_reduc_strict_1_run.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_2.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_2_run.c: Likewise.
>         * gcc.target/aarch64/sve_reduc_strict_3.c: Likewise.
>         * gcc.target/aarch64/sve_slp_13.c: Add floating-point types.
>         * gfortran.dg/vect/vect-8.f90: Expect 25 loops to be vectorized if
>         vect_fold_left_plus.
>
> Index: gcc/tree.def
> ===================================================================
> --- gcc/tree.def        2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree.def        2017-11-17 16:52:07.631930981 +0000
> @@ -1302,6 +1302,8 @@ DEFTREECODE (REDUC_AND_EXPR, "reduc_and_
>  DEFTREECODE (REDUC_IOR_EXPR, "reduc_ior_expr", tcc_unary, 1)
>  DEFTREECODE (REDUC_XOR_EXPR, "reduc_xor_expr", tcc_unary, 1)
>
> +DEFTREECODE (FOLD_LEFT_PLUS_EXPR, "fold_left_plus_expr", tcc_binary, 2)
> +
>  /* Widening dot-product.
>     The first two arguments are of type t1.
>     The third argument and the result are of type t2, such that t2 is at least
> Index: gcc/doc/generic.texi
> ===================================================================
> --- gcc/doc/generic.texi        2017-11-17 16:52:07.246852461 +0000
> +++ gcc/doc/generic.texi        2017-11-17 16:52:07.620954871 +0000
> @@ -1746,6 +1746,7 @@ a value from @code{enum annot_expr_kind}
>  @tindex REDUC_AND_EXPR
>  @tindex REDUC_IOR_EXPR
>  @tindex REDUC_XOR_EXPR
> +@tindex FOLD_LEFT_PLUS_EXPR
>
>  @table @code
>  @item VEC_DUPLICATE_EXPR
> @@ -1861,6 +1862,12 @@ the maximum element in @var{x}.  The ass
>  is unspecified; for example, @samp{REDUC_PLUS_EXPR <@var{x}>} could
>  sum floating-point @var{x} in forward order, in reverse order,
>  using a tree, or in some other way.
> +
> +@item FOLD_LEFT_PLUS_EXPR
> +This node takes two arguments: a scalar of type @var{t} and a vector
> +of @var{t}s.  It successively adds each element of the vector to the
> +scalar and returns the result.  The operation is strictly in-order:
> +there is no reassociation.
>  @end table
>
>
> Index: gcc/optabs.def
> ===================================================================
> --- gcc/optabs.def      2017-11-17 16:52:07.246852461 +0000
> +++ gcc/optabs.def      2017-11-17 16:52:07.625528250 +0000
> @@ -306,6 +306,7 @@ OPTAB_D (reduc_umin_scal_optab, "reduc_u
>  OPTAB_D (reduc_and_scal_optab,  "reduc_and_scal_$a")
>  OPTAB_D (reduc_ior_scal_optab,  "reduc_ior_scal_$a")
>  OPTAB_D (reduc_xor_scal_optab,  "reduc_xor_scal_$a")
> +OPTAB_D (fold_left_plus_optab, "fold_left_plus_$a")
>
>  OPTAB_D (extract_last_optab, "extract_last_$a")
>  OPTAB_D (fold_extract_last_optab, "fold_extract_last_$a")
> Index: gcc/doc/md.texi
> ===================================================================
> --- gcc/doc/md.texi     2017-11-17 16:52:07.246852461 +0000
> +++ gcc/doc/md.texi     2017-11-17 16:52:07.621869547 +0000
> @@ -5285,6 +5285,14 @@ has mode @var{m} and operands 0 and 1 ha
>  one element of @var{m}.  Operand 2 has the usual mask mode for vectors
>  of mode @var{m}; see @code{TARGET_VECTORIZE_GET_MASK_MODE}.
>
> +@cindex @code{fold_left_plus_@var{m}} instruction pattern
> +@item @code{fold_left_plus_@var{m}}
> +Take scalar operand 1 and successively add each element from vector
> +operand 2.  Store the result in scalar operand 0.  The vector has
> +mode @var{m} and the scalars have the mode appropriate for one
> +element of @var{m}.  The operation is strictly in-order: there is
> +no reassociation.
> +
>  @cindex @code{sdot_prod@var{m}} instruction pattern
>  @item @samp{sdot_prod@var{m}}
>  @cindex @code{udot_prod@var{m}} instruction pattern
> Index: gcc/doc/sourcebuild.texi
> ===================================================================
> --- gcc/doc/sourcebuild.texi    2017-11-17 16:52:07.246852461 +0000
> +++ gcc/doc/sourcebuild.texi    2017-11-17 16:52:07.621869547 +0000
> @@ -1580,6 +1580,9 @@ Target supports AND, IOR and XOR reducti
>
>  @item vect_fold_extract_last
>  Target supports the @code{fold_extract_last} optab.
> +
> +@item vect_fold_left_plus
> +Target supports the @code{fold_left_plus} optab.
>  @end table
>
>  @subsubsection Thread Local Storage attributes
> Index: gcc/cfgexpand.c
> ===================================================================
> --- gcc/cfgexpand.c     2017-11-17 16:52:07.246852461 +0000
> +++ gcc/cfgexpand.c     2017-11-17 16:52:07.620040195 +0000
> @@ -5072,6 +5072,7 @@ expand_debug_expr (tree exp)
>      case REDUC_AND_EXPR:
>      case REDUC_IOR_EXPR:
>      case REDUC_XOR_EXPR:
> +    case FOLD_LEFT_PLUS_EXPR:
>      case VEC_COND_EXPR:
>      case VEC_PACK_FIX_TRUNC_EXPR:
>      case VEC_PACK_SAT_EXPR:
> Index: gcc/expr.c
> ===================================================================
> --- gcc/expr.c  2017-11-17 16:52:07.246852461 +0000
> +++ gcc/expr.c  2017-11-17 16:52:07.622784222 +0000
> @@ -9438,6 +9438,28 @@ #define REDUCE_BIT_FIELD(expr)   (reduce_b
>          return target;
>        }
>
> +    case FOLD_LEFT_PLUS_EXPR:
> +      {
> +       op0 = expand_normal (treeop0);
> +       op1 = expand_normal (treeop1);
> +       this_optab = optab_for_tree_code (code, type, optab_default);
> +       machine_mode vec_mode = TYPE_MODE (TREE_TYPE (treeop1));
> +       insn_code icode = optab_handler (this_optab, vec_mode);
> +
> +       if (icode != CODE_FOR_nothing)
> +         {
> +           struct expand_operand ops[3];
> +           create_output_operand (&ops[0], target, mode);
> +           create_input_operand (&ops[1], op0, mode);
> +           create_input_operand (&ops[2], op1, vec_mode);
> +           if (maybe_expand_insn (icode, 3, ops))
> +             return ops[0].value;
> +         }
> +
> +       /* Nothing to fall back to.  */
> +       gcc_unreachable ();
> +      }
> +
>      case REDUC_MAX_EXPR:
>      case REDUC_MIN_EXPR:
>      case REDUC_PLUS_EXPR:
> Index: gcc/fold-const.c
> ===================================================================
> --- gcc/fold-const.c    2017-11-17 16:52:07.246852461 +0000
> +++ gcc/fold-const.c    2017-11-17 16:52:07.623698898 +0000
> @@ -1603,6 +1603,32 @@ const_binop (enum tree_code code, tree a
>         return NULL_TREE;
>        return build_vector_from_val (TREE_TYPE (arg1), sub);
>      }
> +
> +  if (CONSTANT_CLASS_P (arg1)
> +      && TREE_CODE (arg2) == VECTOR_CST)
> +    {
> +      tree_code subcode;
> +
> +      switch (code)
> +       {
> +       case FOLD_LEFT_PLUS_EXPR:
> +         subcode = PLUS_EXPR;
> +         break;
> +       default:
> +         return NULL_TREE;
> +       }
> +
> +      int nelts = VECTOR_CST_NELTS (arg2);
> +      tree accum = arg1;
> +      for (int i = 0; i < nelts; i++)
> +       {
> +         accum = const_binop (subcode, accum, VECTOR_CST_ELT (arg2, i));
> +         if (accum == NULL_TREE || !CONSTANT_CLASS_P (accum))
> +           return NULL_TREE;
> +       }
> +
> +      return accum;
> +    }
>    return NULL_TREE;
>  }
>
> Index: gcc/optabs-tree.c
> ===================================================================
> --- gcc/optabs-tree.c   2017-11-17 16:52:07.246852461 +0000
> +++ gcc/optabs-tree.c   2017-11-17 16:52:07.623698898 +0000
> @@ -166,6 +166,9 @@ optab_for_tree_code (enum tree_code code
>      case REDUC_XOR_EXPR:
>        return reduc_xor_scal_optab;
>
> +    case FOLD_LEFT_PLUS_EXPR:
> +      return fold_left_plus_optab;
> +
>      case VEC_WIDEN_MULT_HI_EXPR:
>        return TYPE_UNSIGNED (type) ?
>         vec_widen_umult_hi_optab : vec_widen_smult_hi_optab;
> Index: gcc/tree-cfg.c
> ===================================================================
> --- gcc/tree-cfg.c      2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-cfg.c      2017-11-17 16:52:07.628272277 +0000
> @@ -4116,6 +4116,19 @@ verify_gimple_assign_binary (gassign *st
>        /* Continue with generic binary expression handling.  */
>        break;
>
> +    case FOLD_LEFT_PLUS_EXPR:
> +      if (!VECTOR_TYPE_P (rhs2_type)
> +         || !useless_type_conversion_p (lhs_type, TREE_TYPE (rhs2_type))
> +         || !useless_type_conversion_p (lhs_type, rhs1_type))
> +       {
> +         error ("reduction should convert from vector to element type");
> +         debug_generic_expr (lhs_type);
> +         debug_generic_expr (rhs1_type);
> +         debug_generic_expr (rhs2_type);
> +         return true;
> +       }
> +      return false;
> +
>      case VEC_SERIES_EXPR:
>        if (!useless_type_conversion_p (rhs1_type, rhs2_type))
>         {
> Index: gcc/tree-inline.c
> ===================================================================
> --- gcc/tree-inline.c   2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-inline.c   2017-11-17 16:52:07.628272277 +0000
> @@ -3881,6 +3881,7 @@ estimate_operator_cost (enum tree_code c
>      case REDUC_AND_EXPR:
>      case REDUC_IOR_EXPR:
>      case REDUC_XOR_EXPR:
> +    case FOLD_LEFT_PLUS_EXPR:
>      case WIDEN_SUM_EXPR:
>      case WIDEN_MULT_EXPR:
>      case DOT_PROD_EXPR:
> Index: gcc/tree-pretty-print.c
> ===================================================================
> --- gcc/tree-pretty-print.c     2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-pretty-print.c     2017-11-17 16:52:07.629186953 +0000
> @@ -3232,6 +3232,7 @@ dump_generic_node (pretty_printer *pp, t
>        break;
>
>      case VEC_SERIES_EXPR:
> +    case FOLD_LEFT_PLUS_EXPR:
>      case VEC_WIDEN_MULT_HI_EXPR:
>      case VEC_WIDEN_MULT_LO_EXPR:
>      case VEC_WIDEN_MULT_EVEN_EXPR:
> @@ -3628,6 +3629,7 @@ op_code_prio (enum tree_code code)
>      case REDUC_MAX_EXPR:
>      case REDUC_MIN_EXPR:
>      case REDUC_PLUS_EXPR:
> +    case FOLD_LEFT_PLUS_EXPR:
>      case VEC_UNPACK_HI_EXPR:
>      case VEC_UNPACK_LO_EXPR:
>      case VEC_UNPACK_FLOAT_HI_EXPR:
> @@ -3749,6 +3751,9 @@ op_symbol_code (enum tree_code code)
>      case REDUC_PLUS_EXPR:
>        return "r+";
>
> +    case FOLD_LEFT_PLUS_EXPR:
> +      return "fl+";
> +
>      case WIDEN_SUM_EXPR:
>        return "w+";
>
> Index: gcc/tree-vect-stmts.c
> ===================================================================
> --- gcc/tree-vect-stmts.c       2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-vect-stmts.c       2017-11-17 16:52:07.631016305 +0000
> @@ -5415,6 +5415,10 @@ vectorizable_operation (gimple *stmt, gi
>
>    code = gimple_assign_rhs_code (stmt);
>
> +  /* Ignore operations that mix scalar and vector input operands.  */
> +  if (code == FOLD_LEFT_PLUS_EXPR)
> +    return false;
> +
>    /* For pointer addition, we should use the normal plus for
>       the vector addition.  */
>    if (code == POINTER_PLUS_EXPR)
> Index: gcc/tree-parloops.c
> ===================================================================
> --- gcc/tree-parloops.c 2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-parloops.c 2017-11-17 16:52:07.629186953 +0000
> @@ -2531,6 +2531,19 @@ set_reduc_phi_uids (reduction_info **slo
>    return 1;
>  }
>
> +/* Return true if the type of reduction performed by STMT is suitable
> +   for this pass.  */
> +
> +static bool
> +valid_reduction_p (gimple *stmt)
> +{
> +  /* Parallelization would reassociate the operation, which isn't
> +     allowed for in-order reductions.  */
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  vect_reduction_type reduc_type = STMT_VINFO_REDUC_TYPE (stmt_info);
> +  return reduc_type != FOLD_LEFT_REDUCTION;
> +}
> +
>  /* Detect all reductions in the LOOP, insert them into REDUCTION_LIST.  */
>
>  static void
> @@ -2564,7 +2577,7 @@ gather_scalar_reductions (loop_p loop, r
>        gimple *reduc_stmt
>         = vect_force_simple_reduction (simple_loop_info, phi,
>                                        &double_reduc, true);
> -      if (!reduc_stmt)
> +      if (!reduc_stmt || !valid_reduction_p (reduc_stmt))
>         continue;
>
>        if (double_reduc)
> @@ -2610,7 +2623,8 @@ gather_scalar_reductions (loop_p loop, r
>                 = vect_force_simple_reduction (simple_loop_info, inner_phi,
>                                                &double_reduc, true);
>               gcc_assert (!double_reduc);
> -             if (inner_reduc_stmt == NULL)
> +             if (inner_reduc_stmt == NULL
> +                 || !valid_reduction_p (inner_reduc_stmt))
>                 continue;
>
>               build_new_reduction (reduction_list, double_reduc_stmts[i], 
> phi);
> Index: gcc/tree-vectorizer.h
> ===================================================================
> --- gcc/tree-vectorizer.h       2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-vectorizer.h       2017-11-17 16:52:07.631016305 +0000
> @@ -74,7 +74,15 @@ enum vect_reduction_type {
>
>         for (int i = 0; i < VF; ++i)
>           res = cond[i] ? val[i] : res;  */
> -  EXTRACT_LAST_REDUCTION
> +  EXTRACT_LAST_REDUCTION,
> +
> +  /* Use a folding reduction within the loop to implement:
> +
> +       for (int i = 0; i < VF; ++i)
> +         res = res OP val[i];
> +
> +     (with no reassocation).  */
> +  FOLD_LEFT_REDUCTION
>  };
>
>  #define VECTORIZABLE_CYCLE_DEF(D) (((D) == vect_reduction_def)           \
> @@ -1389,6 +1397,7 @@ extern void vect_model_load_cost (stmt_v
>  extern unsigned record_stmt_cost (stmt_vector_for_cost *, int,
>                                   enum vect_cost_for_stmt, stmt_vec_info,
>                                   int, enum vect_cost_model_location);
> +extern void vect_finish_replace_stmt (gimple *, gimple *);
>  extern void vect_finish_stmt_generation (gimple *, gimple *,
>                                           gimple_stmt_iterator *);
>  extern bool vect_mark_stmts_to_be_vectorized (loop_vec_info);
> Index: gcc/tree-vect-loop.c
> ===================================================================
> --- gcc/tree-vect-loop.c        2017-11-17 16:52:07.246852461 +0000
> +++ gcc/tree-vect-loop.c        2017-11-17 16:52:07.630101629 +0000
> @@ -2573,6 +2573,29 @@ vect_analyze_loop (struct loop *loop, lo
>      }
>  }
>
> +/* Return true if the target supports in-order reductions for operation
> +   CODE and type TYPE.  If the target supports it, store the reduction
> +   operation in *REDUC_CODE.  */
> +
> +static bool
> +fold_left_reduction_code (tree_code code, tree type, tree_code *reduc_code)
> +{
> +  switch (code)
> +    {
> +    case PLUS_EXPR:
> +      code = FOLD_LEFT_PLUS_EXPR;
> +      break;
> +
> +    default:
> +      return false;
> +    }
> +
> +  if (!target_supports_op_p (type, code, optab_vector))
> +    return false;
> +
> +  *reduc_code = code;
> +  return true;
> +}
>
>  /* Function reduction_code_for_scalar_code
>
> @@ -2880,6 +2903,42 @@ vect_is_slp_reduction (loop_vec_info loo
>    return true;
>  }
>
> +/* Returns true if we need an in-order reduction for operation CODE
> +   on type TYPE.  NEED_WRAPPING_INTEGRAL_OVERFLOW is true if integer
> +   overflow must wrap.  */
> +
> +static bool
> +needs_fold_left_reduction_p (tree type, tree_code code,
> +                            bool need_wrapping_integral_overflow)
> +{
> +  /* CHECKME: check for !flag_finite_math_only too?  */
> +  if (SCALAR_FLOAT_TYPE_P (type))
> +    switch (code)
> +      {
> +      case MIN_EXPR:
> +      case MAX_EXPR:
> +       return false;
> +
> +      default:
> +       return !flag_associative_math;
> +      }
> +
> +  if (INTEGRAL_TYPE_P (type))
> +    {
> +      if (!operation_no_trapping_overflow (type, code))
> +       return true;
> +      if (need_wrapping_integral_overflow
> +         && !TYPE_OVERFLOW_WRAPS (type)
> +         && operation_can_overflow (code))
> +       return true;
> +      return false;
> +    }
> +
> +  if (SAT_FIXED_POINT_TYPE_P (type))
> +    return true;
> +
> +  return false;
> +}
>
>  /* Function vect_is_simple_reduction
>
> @@ -3198,58 +3257,18 @@ vect_is_simple_reduction (loop_vec_info
>        return NULL;
>      }
>
> -  /* Check that it's ok to change the order of the computation.
> +  /* Check whether it's ok to change the order of the computation.
>       Generally, when vectorizing a reduction we change the order of the
>       computation.  This may change the behavior of the program in some
>       cases, so we need to check that this is ok.  One exception is when
>       vectorizing an outer-loop: the inner-loop is executed sequentially,
>       and therefore vectorizing reductions in the inner-loop during
>       outer-loop vectorization is safe.  */
> -
> -  if (*v_reduc_type != COND_REDUCTION
> -      && check_reduction)
> -    {
> -      /* CHECKME: check for !flag_finite_math_only too?  */
> -      if (SCALAR_FLOAT_TYPE_P (type) && !flag_associative_math)
> -       {
> -         /* Changing the order of operations changes the semantics.  */
> -         if (dump_enabled_p ())
> -           report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                       "reduction: unsafe fp math optimization: ");
> -         return NULL;
> -       }
> -      else if (INTEGRAL_TYPE_P (type))
> -       {
> -         if (!operation_no_trapping_overflow (type, code))
> -           {
> -             /* Changing the order of operations changes the semantics.  */
> -             if (dump_enabled_p ())
> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                               "reduction: unsafe int math optimization"
> -                               " (overflow traps): ");
> -             return NULL;
> -           }
> -         if (need_wrapping_integral_overflow
> -             && !TYPE_OVERFLOW_WRAPS (type)
> -             && operation_can_overflow (code))
> -           {
> -             /* Changing the order of operations changes the semantics.  */
> -             if (dump_enabled_p ())
> -               report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                               "reduction: unsafe int math optimization"
> -                               " (overflow doesn't wrap): ");
> -             return NULL;
> -           }
> -       }
> -      else if (SAT_FIXED_POINT_TYPE_P (type))
> -       {
> -         /* Changing the order of operations changes the semantics.  */
> -         if (dump_enabled_p ())
> -         report_vect_op (MSG_MISSED_OPTIMIZATION, def_stmt,
> -                         "reduction: unsafe fixed-point math optimization: 
> ");
> -         return NULL;
> -       }
> -    }
> +  if (check_reduction
> +      && *v_reduc_type == TREE_CODE_REDUCTION
> +      && needs_fold_left_reduction_p (type, code,
> +                                     need_wrapping_integral_overflow))
> +    *v_reduc_type = FOLD_LEFT_REDUCTION;
>
>    /* Reduction is safe. We're dealing with one of the following:
>       1) integer arithmetic and no trapv
> @@ -3513,6 +3532,7 @@ vect_force_simple_reduction (loop_vec_in
>        STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
>        STMT_VINFO_REDUC_DEF (reduc_def_info) = def;
>        reduc_def_info = vinfo_for_stmt (def);
> +      STMT_VINFO_REDUC_TYPE (reduc_def_info) = v_reduc_type;
>        STMT_VINFO_REDUC_DEF (reduc_def_info) = phi;
>      }
>    return def;
> @@ -4065,7 +4085,8 @@ vect_model_reduction_cost (stmt_vec_info
>
>    code = gimple_assign_rhs_code (orig_stmt);
>
> -  if (reduction_type == EXTRACT_LAST_REDUCTION)
> +  if (reduction_type == EXTRACT_LAST_REDUCTION
> +      || reduction_type == FOLD_LEFT_REDUCTION)
>      {
>        /* No extra instructions needed in the prologue.  */
>        prologue_cost = 0;
> @@ -4138,7 +4159,8 @@ vect_model_reduction_cost (stmt_vec_info
>                                           scalar_stmt, stmt_info, 0,
>                                           vect_epilogue);
>         }
> -      else if (reduction_type == EXTRACT_LAST_REDUCTION)
> +      else if (reduction_type == EXTRACT_LAST_REDUCTION
> +              || reduction_type == FOLD_LEFT_REDUCTION)
>         /* No extra instructions need in the epilogue.  */
>         ;
>        else
> @@ -5884,6 +5906,155 @@ vect_create_epilog_for_reduction (vec<tr
>      }
>  }
>
> +/* Return a vector of type VECTYPE that is equal to the vector select
> +   operation "MASK ? VEC : IDENTITY".  Insert the select statements
> +   before GSI.  */
> +
> +static tree
> +merge_with_identity (gimple_stmt_iterator *gsi, tree mask, tree vectype,
> +                    tree vec, tree identity)
> +{
> +  tree cond = make_temp_ssa_name (vectype, NULL, "cond");
> +  gimple *new_stmt = gimple_build_assign (cond, VEC_COND_EXPR,
> +                                         mask, vec, identity);
> +  gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +  return cond;
> +}
> +
> +/* Perform an in-order reduction (FOLD_LEFT_REDUCTION).  STMT is the
> +   statement that sets the live-out value.  REDUC_DEF_STMT is the phi
> +   statement.  CODE is the operation performed by STMT and OPS are
> +   its scalar operands.  REDUC_INDEX is the index of the operand in
> +   OPS that is set by REDUC_DEF_STMT.  REDUC_CODE is the code that
> +   implements in-order reduction and VECTYPE_IN is the type of its
> +   vector input.  MASKS specifies the masks that should be used to
> +   control the operation in a fully-masked loop.  */
> +
> +static bool
> +vectorize_fold_left_reduction (gimple *stmt, gimple_stmt_iterator *gsi,
> +                              gimple **vec_stmt, slp_tree slp_node,
> +                              gimple *reduc_def_stmt,
> +                              tree_code code, tree_code reduc_code,
> +                              tree ops[3], tree vectype_in,
> +                              int reduc_index, vec_loop_masks *masks)
> +{
> +  stmt_vec_info stmt_info = vinfo_for_stmt (stmt);
> +  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_info);
> +  struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
> +  tree vectype_out = STMT_VINFO_VECTYPE (stmt_info);
> +  gimple *new_stmt = NULL;
> +
> +  int ncopies;
> +  if (slp_node)
> +    ncopies = 1;
> +  else
> +    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> +
> +  gcc_assert (!nested_in_vect_loop_p (loop, stmt));
> +  gcc_assert (ncopies == 1);
> +  gcc_assert (TREE_CODE_LENGTH (code) == binary_op);
> +  gcc_assert (reduc_index == (code == MINUS_EXPR ? 0 : 1));
> +  gcc_assert (STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info)
> +             == FOLD_LEFT_REDUCTION);
> +
> +  if (slp_node)
> +    gcc_assert (must_eq (TYPE_VECTOR_SUBPARTS (vectype_out),
> +                        TYPE_VECTOR_SUBPARTS (vectype_in)));
> +
> +  tree op0 = ops[1 - reduc_index];
> +
> +  int group_size = 1;
> +  gimple *scalar_dest_def;
> +  auto_vec<tree> vec_oprnds0;
> +  if (slp_node)
> +    {
> +      vect_get_vec_defs (op0, NULL_TREE, stmt, &vec_oprnds0, NULL, slp_node);
> +      group_size = SLP_TREE_SCALAR_STMTS (slp_node).length ();
> +      scalar_dest_def = SLP_TREE_SCALAR_STMTS (slp_node)[group_size - 1];
> +    }
> +  else
> +    {
> +      tree loop_vec_def0 = vect_get_vec_def_for_operand (op0, stmt);
> +      vec_oprnds0.create (1);
> +      vec_oprnds0.quick_push (loop_vec_def0);
> +      scalar_dest_def = stmt;
> +    }
> +
> +  tree scalar_dest = gimple_assign_lhs (scalar_dest_def);
> +  tree scalar_type = TREE_TYPE (scalar_dest);
> +  tree reduc_var = gimple_phi_result (reduc_def_stmt);
> +
> +  int vec_num = vec_oprnds0.length ();
> +  gcc_assert (vec_num == 1 || slp_node);
> +  tree vec_elem_type = TREE_TYPE (vectype_out);
> +  gcc_checking_assert (useless_type_conversion_p (scalar_type, 
> vec_elem_type));
> +
> +  tree vector_identity = NULL_TREE;
> +  if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +    vector_identity = build_zero_cst (vectype_out);
> +
> +  int i;
> +  tree def0;
> +  FOR_EACH_VEC_ELT (vec_oprnds0, i, def0)
> +    {
> +      tree mask = NULL_TREE;
> +      if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo))
> +       mask = vect_get_loop_mask (gsi, masks, vec_num, vectype_in, i);
> +
> +      /* Handle MINUS by adding the negative.  */
> +      if (code == MINUS_EXPR)
> +       {
> +         tree negated = make_ssa_name (vectype_out);
> +         new_stmt = gimple_build_assign (negated, NEGATE_EXPR, def0);
> +         gsi_insert_before (gsi, new_stmt, GSI_SAME_STMT);
> +         def0 = negated;
> +       }
> +
> +      if (mask)
> +       def0 = merge_with_identity (gsi, mask, vectype_out, def0,
> +                                   vector_identity);
> +
> +      /* On the first iteration the input is simply the scalar phi
> +        result, and for subsequent iterations it is the output of
> +        the preceding operation.  */
> +      tree expr = build2 (reduc_code, scalar_type, reduc_var, def0);
> +
> +      /* For chained SLP reductions the output of the previous reduction
> +        operation serves as the input of the next. For the final statement
> +        the output cannot be a temporary - we reuse the original
> +        scalar destination of the last statement.  */
> +      if (i == vec_num - 1)
> +       reduc_var = scalar_dest;
> +      else
> +       reduc_var = vect_create_destination_var (scalar_dest, NULL);
> +      new_stmt = gimple_build_assign (reduc_var, expr);
> +
> +      if (i == vec_num - 1)
> +       {
> +         SSA_NAME_DEF_STMT (reduc_var) = new_stmt;
> +         /* For chained SLP stmt is the first statement in the group and
> +            gsi points to the last statement in the group.  For non SLP stmt
> +            points to the same location as gsi. In either case tmp_gsi and 
> gsi
> +            should both point to the same insertion point.  */
> +         gcc_assert (scalar_dest_def == gsi_stmt (*gsi));
> +         vect_finish_replace_stmt (scalar_dest_def, new_stmt);
> +       }
> +      else
> +       {
> +         reduc_var = make_ssa_name (reduc_var, new_stmt);
> +         gimple_assign_set_lhs (new_stmt, reduc_var);
> +         vect_finish_stmt_generation (stmt, new_stmt, gsi);
> +       }
> +
> +      if (slp_node)
> +       SLP_TREE_VEC_STMTS (slp_node).quick_push (new_stmt);
> +    }
> +
> +  if (!slp_node)
> +    STMT_VINFO_VEC_STMT (stmt_info) = *vec_stmt = new_stmt;
> +
> +  return true;
> +}
>
>  /* Function is_nonwrapping_integer_induction.
>
> @@ -6063,6 +6234,12 @@ vectorizable_reduction (gimple *stmt, gi
>           return true;
>         }
>
> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
> +       /* Leave the scalar phi in place.  Note that checking
> +          STMT_VINFO_VEC_REDUCTION_TYPE (as below) only works
> +          for reductions involving a single statement.  */
> +       return true;
> +
>        gimple *reduc_stmt = STMT_VINFO_REDUC_DEF (stmt_info);
>        if (STMT_VINFO_IN_PATTERN_P (vinfo_for_stmt (reduc_stmt)))
>         reduc_stmt = STMT_VINFO_RELATED_STMT (vinfo_for_stmt (reduc_stmt));
> @@ -6289,6 +6466,14 @@ vectorizable_reduction (gimple *stmt, gi
>       directy used in stmt.  */
>    if (reduc_index == -1)
>      {
> +      if (STMT_VINFO_REDUC_TYPE (stmt_info) == FOLD_LEFT_REDUCTION)
> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "in-order reduction chain without SLP.\n");
> +         return false;
> +       }
> +
>        if (orig_stmt)
>         reduc_def_stmt = STMT_VINFO_REDUC_DEF (orig_stmt_info);
>        else
> @@ -6508,7 +6693,9 @@ vectorizable_reduction (gimple *stmt, gi
>
>    vect_reduction_type reduction_type
>      = STMT_VINFO_VEC_REDUCTION_TYPE (stmt_info);
> -  if (orig_stmt && reduction_type == TREE_CODE_REDUCTION)
> +  if (orig_stmt
> +      && (reduction_type == TREE_CODE_REDUCTION
> +         || reduction_type == FOLD_LEFT_REDUCTION))
>      {
>        /* This is a reduction pattern: get the vectype from the type of the
>           reduction variable, and get the tree-code from orig_stmt.  */
> @@ -6555,13 +6742,22 @@ vectorizable_reduction (gimple *stmt, gi
>    epilog_reduc_code = ERROR_MARK;
>
>    if (reduction_type == TREE_CODE_REDUCTION
> +      || reduction_type == FOLD_LEFT_REDUCTION
>        || reduction_type == INTEGER_INDUC_COND_REDUCTION
>        || reduction_type == CONST_COND_REDUCTION)
>      {
> -      if (reduction_code_for_scalar_code (orig_code, &epilog_reduc_code))
> +      bool have_reduc_support;
> +      if (reduction_type == FOLD_LEFT_REDUCTION)
> +       have_reduc_support = fold_left_reduction_code (orig_code, vectype_out,
> +                                                      &epilog_reduc_code);
> +      else
> +       have_reduc_support
> +         = reduction_code_for_scalar_code (orig_code, &epilog_reduc_code);
> +
> +      if (have_reduc_support)
>         {
>           reduc_optab = optab_for_tree_code (epilog_reduc_code, vectype_out,
> -                                         optab_default);
> +                                            optab_default);
>           if (!reduc_optab)
>             {
>               if (dump_enabled_p ())
> @@ -6687,6 +6883,41 @@ vectorizable_reduction (gimple *stmt, gi
>         }
>      }
>
> +  if (double_reduc && reduction_type == FOLD_LEFT_REDUCTION)
> +    {
> +      /* We can't support in-order reductions of code such as this:
> +
> +          for (int i = 0; i < n1; ++i)
> +            for (int j = 0; j < n2; ++j)
> +              l += a[j];
> +
> +        since GCC effectively transforms the loop when vectorizing:
> +
> +          for (int i = 0; i < n1 / VF; ++i)
> +            for (int j = 0; j < n2; ++j)
> +              for (int k = 0; k < VF; ++k)
> +                l += a[j];
> +
> +        which is a reassociation of the original operation.  */
> +      if (dump_enabled_p ())
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "in-order double reduction not supported.\n");
> +
> +      return false;
> +    }
> +
> +  if (reduction_type == FOLD_LEFT_REDUCTION
> +      && slp_node
> +      && !GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
> +    {
> +      /* We cannot in-order reductions in this case because there is
> +         an implicit reassociation of the operations involved.  */
> +      if (dump_enabled_p ())
> +        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "in-order unchained SLP reductions not 
> supported.\n");
> +      return false;
> +    }
> +
>    /* In case of widenning multiplication by a constant, we update the type
>       of the constant to be the type of the other operand.  We check that the
>       constant fits the type in the pattern recognition pass.  */
> @@ -6807,9 +7038,10 @@ vectorizable_reduction (gimple *stmt, gi
>         vect_model_reduction_cost (stmt_info, epilog_reduc_code, ncopies);
>        if (loop_vinfo && LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo))
>         {
> -         if (cond_fn == IFN_LAST
> -             || !direct_internal_fn_supported_p (cond_fn, vectype_in,
> -                                                 OPTIMIZE_FOR_SPEED))
> +         if (reduction_type != FOLD_LEFT_REDUCTION
> +             && (cond_fn == IFN_LAST
> +                 || !direct_internal_fn_supported_p (cond_fn, vectype_in,
> +                                                     OPTIMIZE_FOR_SPEED)))
>             {
>               if (dump_enabled_p ())
>                 dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -6844,6 +7076,11 @@ vectorizable_reduction (gimple *stmt, gi
>
>    bool masked_loop_p = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo);
>
> +  if (reduction_type == FOLD_LEFT_REDUCTION)
> +    return vectorize_fold_left_reduction
> +      (stmt, gsi, vec_stmt, slp_node, reduc_def_stmt, code,
> +       epilog_reduc_code, ops, vectype_in, reduc_index, masks);
> +
>    if (reduction_type == EXTRACT_LAST_REDUCTION)
>      {
>        gcc_assert (!slp_node);
> Index: gcc/config/aarch64/aarch64.md
> ===================================================================
> --- gcc/config/aarch64/aarch64.md       2017-11-17 16:52:07.246852461 +0000
> +++ gcc/config/aarch64/aarch64.md       2017-11-17 16:52:07.620954871 +0000
> @@ -164,6 +164,7 @@ (define_c_enum "unspec" [
>      UNSPEC_STN
>      UNSPEC_INSR
>      UNSPEC_CLASTB
> +    UNSPEC_FADDA
>  ])
>
>  (define_c_enum "unspecv" [
> Index: gcc/config/aarch64/aarch64-sve.md
> ===================================================================
> --- gcc/config/aarch64/aarch64-sve.md   2017-11-17 16:52:07.246852461 +0000
> +++ gcc/config/aarch64/aarch64-sve.md   2017-11-17 16:52:07.620040195 +0000
> @@ -1574,6 +1574,45 @@ (define_insn "*reduc_<optab>_scal_<mode>
>    "<bit_reduc_op>\t%<Vetype>0, %1, %2.<Vetype>"
>  )
>
> +;; Unpredicated in-order FP reductions.
> +(define_expand "fold_left_plus_<mode>"
> +  [(set (match_operand:<VEL> 0 "register_operand")
> +       (unspec:<VEL> [(match_dup 3)
> +                      (match_operand:<VEL> 1 "register_operand")
> +                      (match_operand:SVE_F 2 "register_operand")]
> +                     UNSPEC_FADDA))]
> +  "TARGET_SVE"
> +  {
> +    operands[3] = force_reg (<VPRED>mode, CONSTM1_RTX (<VPRED>mode));
> +  }
> +)
> +
> +;; In-order FP reductions predicated with PTRUE.
> +(define_insn "*fold_left_plus_<mode>"
> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
> +       (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
> +                      (match_operand:<VEL> 2 "register_operand" "0")
> +                      (match_operand:SVE_F 3 "register_operand" "w")]
> +                     UNSPEC_FADDA))]
> +  "TARGET_SVE"
> +  "fadda\t%<Vetype>0, %1, %<Vetype>0, %3.<Vetype>"
> +)
> +
> +;; Predicated form of the above in-order reduction.
> +(define_insn "*pred_fold_left_plus_<mode>"
> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
> +       (unspec:<VEL>
> +         [(match_operand:<VEL> 1 "register_operand" "0")
> +          (unspec:SVE_F
> +            [(match_operand:<VPRED> 2 "register_operand" "Upl")
> +             (match_operand:SVE_F 3 "register_operand" "w")
> +             (match_operand:SVE_F 4 "aarch64_simd_imm_zero")]
> +            UNSPEC_SEL)]
> +         UNSPEC_FADDA))]
> +  "TARGET_SVE"
> +  "fadda\t%<Vetype>0, %2, %<Vetype>0, %3.<Vetype>"
> +)
> +
>  ;; Unpredicated floating-point addition.
>  (define_expand "add<mode>3"
>    [(set (match_operand:SVE_F 0 "register_operand")
> Index: gcc/testsuite/lib/target-supports.exp
> ===================================================================
> --- gcc/testsuite/lib/target-supports.exp       2017-11-17 16:52:07.246852461 
> +0000
> +++ gcc/testsuite/lib/target-supports.exp       2017-11-17 16:52:07.627357602 
> +0000
> @@ -7180,6 +7180,12 @@ proc check_effective_target_vect_fold_ex
>      return [check_effective_target_aarch64_sve]
>  }
>
> +# Return 1 if the target supports the fold_left_plus optab.
> +
> +proc check_effective_target_vect_fold_left_plus { } {
> +    return [check_effective_target_aarch64_sve]
> +}
> +
>  # Return 1 if the target supports section-anchors
>
>  proc check_effective_target_section_anchors { } {
> Index: gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-17 
> 16:52:07.246852461 +0000
> +++ gcc/testsuite/gcc.dg/vect/no-fast-math-vect16.c     2017-11-17 
> 16:52:07.625528250 +0000
> @@ -34,4 +34,4 @@ int main (void)
>  }
>
>  /* Requires fast-math.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail 
> *-*-* } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { xfail { 
> ! vect_fold_left_plus } } } } */
> Index: gcc/testsuite/gcc.dg/vect/pr79920.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.246852461 +0000
> +++ gcc/testsuite/gcc.dg/vect/pr79920.c 2017-11-17 16:52:07.625528250 +0000
> @@ -41,4 +41,5 @@ int main()
>    return 0;
>  }
>
> -/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
> { vect_double && { vect_perm && vect_hw_misalign } } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
> { { vect_double && { ! vect_fold_left_plus } } && { vect_perm && 
> vect_hw_misalign } } } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target 
> { { vect_double && vect_fold_left_plus } && { vect_perm && vect_hw_misalign } 
> } } } } */
> Index: gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-17 
> 16:52:07.246852461 +0000
> +++ gcc/testsuite/gcc.dg/vect/trapv-vect-reduc-4.c      2017-11-17 
> 16:52:07.625528250 +0000
> @@ -46,5 +46,9 @@ int main (void)
>    return 0;
>  }
>
> -/* { dg-final { scan-tree-dump-times "Detected reduction\\." 2 "vect"  } } */
> +/* 2 for the first loop.  */
> +/* { dg-final { scan-tree-dump-times "Detected reduction\\." 3 "vect" { 
> target { ! vect_multiple_sizes } } } } */
> +/* { dg-final { scan-tree-dump "Detected reduction\\." "vect" { target 
> vect_multiple_sizes } } } */
> +/* { dg-final { scan-tree-dump-times "not vectorized" 1 "vect" { target { ! 
> vect_multiple_sizes } } } } */
> +/* { dg-final { scan-tree-dump "not vectorized" "vect" { target 
> vect_multiple_sizes } } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 2 loops" 1 "vect" { target 
> { ! vect_no_int_min_max } } } } */
> Index: gcc/testsuite/gcc.dg/vect/vect-reduc-6.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-17 16:52:07.246852461 
> +0000
> +++ gcc/testsuite/gcc.dg/vect/vect-reduc-6.c    2017-11-17 16:52:07.625528250 
> +0000
> @@ -50,4 +50,5 @@ int main (void)
>
>  /* need -ffast-math to vectorizer these loops.  */
>  /* ARM NEON passes -ffast-math to these tests, so expect this to fail.  */
> -/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { xfail 
> arm_neon_ok } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target 
> { ! vect_fold_left_plus } xfail arm_neon_ok } } } */
> +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target 
> vect_fold_left_plus } } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c
> ===================================================================
> --- /dev/null   2017-11-14 14:28:07.424493901 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1.c       2017-11-17 
> 16:52:07.625528250 +0000
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
> +
> +#define NUM_ELEMS(TYPE) ((int)(5 * (256 / sizeof (TYPE)) + 3))
> +
> +#define DEF_REDUC_PLUS(TYPE)                   \
> +  TYPE __attribute__ ((noinline, noclone))     \
> +  reduc_plus_##TYPE (TYPE *a, TYPE *b)         \
> +  {                                            \
> +    TYPE r = 0, q = 3;                         \
> +    for (int i = 0; i < NUM_ELEMS(TYPE); i++)  \
> +      {                                                \
> +       r += a[i];                              \
> +       q -= b[i];                              \
> +      }                                                \
> +    return r * q;                              \
> +  }
> +
> +#define TEST_ALL(T) \
> +  T (_Float16) \
> +  T (float) \
> +  T (double)
> +
> +TEST_ALL (DEF_REDUC_PLUS)
> +
> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
> z[0-9]+\.h} 2 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s} 2 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d} 2 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c
> ===================================================================
> --- /dev/null   2017-11-14 14:28:07.424493901 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_1_run.c   2017-11-17 
> 16:52:07.625528250 +0000
> @@ -0,0 +1,29 @@
> +/* { dg-do run { target { aarch64_sve_hw } } } */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
> +
> +#include "sve_reduc_strict_1.c"
> +
> +#define TEST_REDUC_PLUS(TYPE)                  \
> +  {                                            \
> +    TYPE a[NUM_ELEMS (TYPE)];                  \
> +    TYPE b[NUM_ELEMS (TYPE)];                  \
> +    TYPE r = 0, q = 3;                         \
> +    for (int i = 0; i < NUM_ELEMS (TYPE); i++) \
> +      {                                                \
> +       a[i] = (i * 0.1) * (i & 1 ? 1 : -1);    \
> +       b[i] = (i * 0.3) * (i & 1 ? 1 : -1);    \
> +       r += a[i];                              \
> +       q -= b[i];                              \
> +       asm volatile ("" ::: "memory");         \
> +      }                                                \
> +    TYPE res = reduc_plus_##TYPE (a, b);       \
> +    if (res != r * q)                          \
> +      __builtin_abort ();                      \
> +  }
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  TEST_ALL (TEST_REDUC_PLUS);
> +  return 0;
> +}
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c
> ===================================================================
> --- /dev/null   2017-11-14 14:28:07.424493901 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2.c       2017-11-17 
> 16:52:07.625528250 +0000
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve" } */
> +
> +#define NUM_ELEMS(TYPE) ((int) (5 * (256 / sizeof (TYPE)) + 3))
> +
> +#define DEF_REDUC_PLUS(TYPE)                                   \
> +void __attribute__ ((noinline, noclone))                       \
> +reduc_plus_##TYPE (TYPE (*restrict a)[NUM_ELEMS(TYPE)],                \
> +                  TYPE *restrict r, int n)                     \
> +{                                                              \
> +  for (int i = 0; i < n; i++)                                  \
> +    {                                                          \
> +      r[i] = 0;                                                        \
> +      for (int j = 0; j < NUM_ELEMS(TYPE); j++)                        \
> +        r[i] += a[i][j];                                       \
> +    }                                                          \
> +}
> +
> +#define TEST_ALL(T) \
> +  T (_Float16) \
> +  T (float) \
> +  T (double)
> +
> +TEST_ALL (DEF_REDUC_PLUS)
> +
> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
> z[0-9]+\.h} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d} 1 } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c
> ===================================================================
> --- /dev/null   2017-11-14 14:28:07.424493901 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_2_run.c   2017-11-17 
> 16:52:07.626442926 +0000
> @@ -0,0 +1,31 @@
> +/* { dg-do run { target { aarch64_sve_hw } } } */
> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve" } */
> +
> +#include "sve_reduc_strict_2.c"
> +
> +#define NROWS 5
> +
> +#define TEST_REDUC_PLUS(TYPE)                                  \
> +  {                                                            \
> +    TYPE a[NROWS][NUM_ELEMS (TYPE)];                           \
> +    TYPE r[NROWS];                                             \
> +    TYPE expected[NROWS] = {};                                 \
> +    for (int i = 0; i < NROWS; ++i)                            \
> +      for (int j = 0; j < NUM_ELEMS (TYPE); ++j)               \
> +       {                                                       \
> +         a[i][j] = (i * 0.1 + j * 0.6) * (j & 1 ? 1 : -1);     \
> +         expected[i] += a[i][j];                               \
> +         asm volatile ("" ::: "memory");                       \
> +       }                                                       \
> +    reduc_plus_##TYPE (a, r, NROWS);                           \
> +    for (int i = 0; i < NROWS; ++i)                            \
> +      if (r[i] != expected[i])                                 \
> +       __builtin_abort ();                                     \
> +  }
> +
> +int __attribute__ ((optimize (1)))
> +main ()
> +{
> +  TEST_ALL (TEST_REDUC_PLUS);
> +  return 0;
> +}
> Index: gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c
> ===================================================================
> --- /dev/null   2017-11-14 14:28:07.424493901 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_reduc_strict_3.c       2017-11-17 
> 16:52:07.626442926 +0000
> @@ -0,0 +1,131 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -fno-inline -march=armv8-a+sve 
> -msve-vector-bits=256 -fdump-tree-vect-details" } */
> +
> +double mat[100][4];
> +double mat2[100][8];
> +double mat3[100][12];
> +double mat4[100][3];
> +
> +double
> +slp_reduc_plus (int n)
> +{
> +  double tmp = 0.0;
> +  for (int i = 0; i < n; i++)
> +    {
> +      tmp = tmp + mat[i][0];
> +      tmp = tmp + mat[i][1];
> +      tmp = tmp + mat[i][2];
> +      tmp = tmp + mat[i][3];
> +    }
> +  return tmp;
> +}
> +
> +double
> +slp_reduc_plus2 (int n)
> +{
> +  double tmp = 0.0;
> +  for (int i = 0; i < n; i++)
> +    {
> +      tmp = tmp + mat2[i][0];
> +      tmp = tmp + mat2[i][1];
> +      tmp = tmp + mat2[i][2];
> +      tmp = tmp + mat2[i][3];
> +      tmp = tmp + mat2[i][4];
> +      tmp = tmp + mat2[i][5];
> +      tmp = tmp + mat2[i][6];
> +      tmp = tmp + mat2[i][7];
> +    }
> +  return tmp;
> +}
> +
> +double
> +slp_reduc_plus3 (int n)
> +{
> +  double tmp = 0.0;
> +  for (int i = 0; i < n; i++)
> +    {
> +      tmp = tmp + mat3[i][0];
> +      tmp = tmp + mat3[i][1];
> +      tmp = tmp + mat3[i][2];
> +      tmp = tmp + mat3[i][3];
> +      tmp = tmp + mat3[i][4];
> +      tmp = tmp + mat3[i][5];
> +      tmp = tmp + mat3[i][6];
> +      tmp = tmp + mat3[i][7];
> +      tmp = tmp + mat3[i][8];
> +      tmp = tmp + mat3[i][9];
> +      tmp = tmp + mat3[i][10];
> +      tmp = tmp + mat3[i][11];
> +    }
> +  return tmp;
> +}
> +
> +void
> +slp_non_chained_reduc (int n, double * restrict out)
> +{
> +  for (int i = 0; i < 3; i++)
> +    out[i] = 0;
> +
> +  for (int i = 0; i < n; i++)
> +    {
> +      out[0] = out[0] + mat4[i][0];
> +      out[1] = out[1] + mat4[i][1];
> +      out[2] = out[2] + mat4[i][2];
> +    }
> +}
> +
> +/* Strict FP reductions shouldn't be used for the outer loops, only the
> +   inner loops.  */
> +
> +float
> +double_reduc1 (float (*restrict i)[16])
> +{
> +  float l = 0;
> +
> +  for (int a = 0; a < 8; a++)
> +    for (int b = 0; b < 8; b++)
> +      l += i[b][a];
> +  return l;
> +}
> +
> +float
> +double_reduc2 (float *restrict i)
> +{
> +  float l = 0;
> +
> +  for (int a = 0; a < 8; a++)
> +    for (int b = 0; b < 16; b++)
> +      {
> +        l += i[b * 4];
> +        l += i[b * 4 + 1];
> +        l += i[b * 4 + 2];
> +        l += i[b * 4 + 3];
> +      }
> +  return l;
> +}
> +
> +float
> +double_reduc3 (float *restrict i, float *restrict j)
> +{
> +  float k = 0, l = 0;
> +
> +  for (int a = 0; a < 8; a++)
> +    for (int b = 0; b < 8; b++)
> +      {
> +        k += i[b];
> +        l += j[b];
> +      }
> +  return l * k;
> +}
> +
> +/* We can't yet handle double_reduc1.  */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s} 3 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d} 9 } } */
> +/* 1 reduction each for double_reduc{1,2} and 2 for double_reduc3.  Each one
> +   is reported three times, once for SVE, once for 128-bit AdvSIMD and once
> +   for 64-bit AdvSIMD.  */
> +/* { dg-final { scan-tree-dump-times "Detected double reduction" 12 "vect" } 
> } */
> +/* double_reduc2 has 2 reductions and slp_non_chained_reduc has 3.
> +   double_reduc1 is reported 3 times (SVE, 128-bit AdvSIMD, 64-bit AdvSIMD)
> +   before failing.  */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 12 "vect" } } */
> Index: gcc/testsuite/gcc.target/aarch64/sve_slp_13.c
> ===================================================================
> --- gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-17 
> 16:52:07.246852461 +0000
> +++ gcc/testsuite/gcc.target/aarch64/sve_slp_13.c       2017-11-17 
> 16:52:07.626442926 +0000
> @@ -1,5 +1,6 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve 
> -msve-vector-bits=scalable" } */
> +/* The cost model thinks that the double loop isn't a win for SVE-128.  */
> +/* { dg-options "-O2 -ftree-vectorize -march=armv8-a+sve 
> -msve-vector-bits=scalable -fno-vect-cost-model" } */
>
>  #include <stdint.h>
>
> @@ -24,7 +25,10 @@ #define TEST_ALL(T)                          \
>    T (int32_t)                                  \
>    T (uint32_t)                                 \
>    T (int64_t)                                  \
> -  T (uint64_t)
> +  T (uint64_t)                                 \
> +  T (_Float16)                                 \
> +  T (float)                                    \
> +  T (double)
>
>  TEST_ALL (VEC_PERM)
>
> @@ -32,21 +36,25 @@ TEST_ALL (VEC_PERM)
>  /* ??? We don't treat the uint loops as SLP.  */
>  /* The loop should be fully-masked.  */
>  /* { dg-final { scan-assembler-times {\tld1b\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1h\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1w\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1w\t} 1 } } */
> -/* { dg-final { scan-assembler-times {\tld1d\t} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tld1d\t} 1 } } */
> +/* { dg-final { scan-assembler-times {\tld1h\t} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tld1w\t} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tld1w\t} 2 } } */
> +/* { dg-final { scan-assembler-times {\tld1d\t} 3 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tld1d\t} 2 } } */
>  /* { dg-final { scan-assembler-not {\tldr} { xfail *-*-* } } } */
>
>  /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b} 4 { xfail *-*-* 
> } } } */
> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 4 { xfail *-*-* 
> } } } */
> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 4 } } */
> -/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 4 } } */
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h} 6 { xfail *-*-* 
> } } } */
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
> +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
>
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.b\n} 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.s\n} 2 } } */
>  /* { dg-final { scan-assembler-times {\tuaddv\td[0-9]+, p[0-7], 
> z[0-9]+\.d\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\th[0-9]+, p[0-7], h[0-9]+, 
> z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\ts[0-9]+, p[0-7], s[0-9]+, 
> z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tfadda\td[0-9]+, p[0-7], d[0-9]+, 
> z[0-9]+\.d\n} 1 } } */
> +/* { dg-final { scan-assembler-not {\tfadd\n} } } */
>
>  /* { dg-final { scan-assembler-not {\tuqdec} } } */
> Index: gcc/testsuite/gfortran.dg/vect/vect-8.f90
> ===================================================================
> --- gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-17 16:52:07.246852461 
> +0000
> +++ gcc/testsuite/gfortran.dg/vect/vect-8.f90   2017-11-17 16:52:07.626442926 
> +0000
> @@ -704,5 +704,6 @@ CALL track('KERNEL  ')
>  RETURN
>  END SUBROUTINE kernel
>
> -! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target 
> { vect_intdouble_cvt } } } }
>  ! { dg-final { scan-tree-dump-times "vectorized 17 loops" 1 "vect" { target 
> { ! vect_intdouble_cvt } } } }
> +! { dg-final { scan-tree-dump-times "vectorized 21 loops" 1 "vect" { target 
> { vect_intdouble_cvt && { ! vect_fold_left_plus } } } } }
> +! { dg-final { scan-tree-dump-times "vectorized 25 loops" 1 "vect" { target 
> { vect_intdouble_cvt && vect_fold_left_plus } } } }

Re: Add support for in-order addition reduction using SVE FADDA

Reply via email to