On Fri, Feb 18, 2022 at 7:20 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This patch improves GCC's scalar evolution and final value replacement
> optimizations by supporting triangular/quadratic/trapezoid chrecs which
> resolves both PR middle-end/65855 and PR c/80852, but alas not (yet)
> PR tree-optimization/46186.
>
> I've listed Richard Biener as co-author, as this solution is based
> heavily on his proposed patch in comment #4 of PR 65855 from 2015,
> but with several significant changes.  The most important change is
> a correction to formula used. For the scalar evolution {a, +, {b, +, c}},
> there was an off-by-one error, so chrec_apply should not return
> a + b*x + c*x*(x+1)/2, but a + b*x + c*x*(x-1)/2, which explains
> why the original patch didn't produce the expected results.
>
> Another significant set of changes involves handling the various
> type mismatches and overflows.  In chrec_apply the evolution is
> evaluated after x iterations (which is an unsigned integer type,
> called orig_type in this patch) but a, b and c may be any type
> including floating point (see PR tree-opt/23391) and including
> types that trap on overflow with -ftrapv (see PR tree-opt/79721),
> and in the case of pointer arithmetic, b and c may have a
> different type (sizetype) to a!  Additionally, the division by
> two in "c*x*(x-1)/2" requires the correct top bit in modulo
> arithmetic, which means that the multiplication typically needs
> to be performed with more precision (in a wider mode) than
> orig_type [unless c is an even integer constant, or x (the
> number of iterations) is a known constant at compile-time].
>
> Finally, final value replacement is only an effective optimization
> if the expression used to compute the result of the loop is cheaper
> than the loop itself, hence chrec_apply needs to return an optimized
> folded tree with the minimal number of operators.  Hence if b == c,
> this patch calculates "a + c*x*(x+1)/2", when c is even we can
> perform the division at compile-time, and when x is a non-trivial
> expression, we wrap it in a SAVE_EXPR so that the lowering to
> gimple can reuse the common subexpression.
>
> Correctly handling all of the corner cases results in a patch
> significantly larger than the original "proof-of-concept".
> There's one remaining refinement, marked as TODO in the patch,
> which is to also support unsigned 64-bit to 128-bit widening
> multiplications (umulditi3) on targets that support it, such as
> x86_64, as used by LLVM, but this patch provides the core
> target-independent functionality.  [This last piece would
> handle/resolve the testcase in PR tree-opt/46186 which
> requires 128-bit TImode operations, not suitable for all
> backend targets].
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check with no new failures.  Ok for mainline?

+/* Fold the (exact) division of chrec by two.  */
+
+static tree
+chrec_fold_divide_by_2 (tree type, tree op)
+{
+  if (SCALAR_FLOAT_TYPE_P (type))
+    return fold_build2 (MULT_EXPR, type, op, build_real (type, dconsthalf));
+  return fold_build2 (RSHIFT_EXPR, type, op, build_int_cst (type, 1));

any reason to not use EXACT_DIV_EXPR?

+/* Indicate that a chrec subexpression is used multiple times.  */
+
+static tree
+chrec_save_expr (tree t)
+{
+  if (is_gimple_min_invariant (t))
+    return t;
+  t = save_expr (t);
+  if (TREE_CODE (t) == SAVE_EXPR)
+    TREE_SIDE_EFFECTS (t) = 0;
+  return t;

why clear TREE_SIDE_EFFECTS here?  IIRC SCEV analysis
simply re-uses subtrees (thus generates a graph) without
wrapping the shared leafs in SAVE_EXPR (and then have
unshare_expr before gimplification blow it up of course...).

+  /* Determine whether c is even.  */
+  tree half_c = NULL_TREE;
+  if (TREE_CODE (c) == INTEGER_CST
+      && (wi::to_widest (c) & 1) == 0)

wi::to_wide should work as well and is cheaper

+  /* Try to narrow the original type, by stripping zero extensions.  */
+  tree orig_type = TREE_TYPE (x);
+  if (TREE_CODE (x) == NOP_EXPR)
+    {

so we do have get_narrower (), can we use that?

+  tree wide_type;
+  /* Prefer to perform multiplications in type CTYPE.  */
+  if (INTEGRAL_TYPE_P (ctype)
+      && TYPE_UNSIGNED (ctype)
+      && TYPE_PRECISION (ctype) > prec)
+    wide_type = ctype;
+  else if (TYPE_PRECISION (unsigned_type_node) > prec)
+    wide_type = unsigned_type_node;
+  else if (TYPE_PRECISION (long_unsigned_type_node) > prec)
+    wide_type = long_unsigned_type_node;
+  else if (TYPE_PRECISION (long_long_unsigned_type_node) > prec)
+    wide_type = long_long_unsigned_type_node;
+  else
+    /* TODO: Try TImode on targets that support umul_widen_optab.  */
+    return chrec_dont_know;

can we use mode_for_size () and type_for_mode () instead of hard-coding
these?  You can look for a umul_optab if we do not want to risk
libgcc calls.

There's some code that looks like it can be factored.  Maybe my eyes
are mis-matching things of course:

+      res = chrec_fold_plus (wide_type, save_x, one);
+      res = chrec_fold_multiply (wide_type, res, save_x);
+      if (half_c)
+       {
+         res = chrec_convert (ctype, res, NULL);
+         res = chrec_fold_multiply (ctype, half_c, res);
+       }
+      else
+       {
+         res = chrec_fold_divide_by_2 (wide_type, res);
+         res = chrec_convert (ctype, res, NULL);
+         res = chrec_fold_multiply (ctype, b, res);
+       }
+      /* + a  */
+      if (ctype != type)
+       res = chrec_convert (type, res, NULL);
+      res = chrec_fold_plus (TREE_TYPE (a), a, res);
+      return res;

looks like a common tail with one optional + b*x?

@@ -3469,6 +3472,19 @@ expression_expensive_p (tree expr,
hash_map<tree, uint64_t> &cache,
       cost += op_cost + 1;
       return false;

+    case tcc_expression:
+      if (code == SAVE_EXPR)
+       {
+         if (expression_expensive_p (TREE_OPERAND (expr, 0), cache, op_cost))
+           return true;
+         /* Assume SAVE_EXPR is instanced twice.  */
+         op_cost /= 2;
+         *cache.get (expr) += op_cost;
+         cost += op_cost;
+         return false;
+       }
+      return true;
+
     default:
       return true;
     }

I think expression_expensive_p assumes we unshare_expr () (correctly,
because we do...) and thus
counts each use of a multi-use.  With SAVE_EXPRs it should instead
have operated in a
walk_tree_without_duplicates mode.  Maybe you should bite the bullet
and wrap expression_expensive_p,
allocating a pointer-set to simply not visit SAVE_EXPRs multiple times
(and not caching it in
'cache')?  I've added the cache explicitely to avoid
expression_expensive_p to be exponential
but make it compute that unshare_expr will be.  Maybe the SAVE_EXPR
behavior can be more
precisely handled?

Looks like stage1 material btw.

Thanks,
Richard.


>
> 2022-02-18  Roger Sayle  <ro...@nextmovesoftware.com>
>             Richard Biener  <rguent...@suse.de>
>
> gcc/ChangeLog
>         PR target/65855
>         PR c/80852
>         * tree-chrec.cc (chrec_fold_divide_by_2): New function to
>         divide a chrec by two, honoring the type of the chrec.
>         (chrec_save_expr): New function to wrap a chrec in a
>         SAVE_EXPR, but without TREE_SIDE_EFFECTS.
>         (chrec_apply_triangular): New helper function of chrec
>         apply to evaluate a quadratic/triangular chrec.
>         (chrec_apply): Expand/clarify comment before function.
>         Handle application of any chrec after zero iterations, i.e. A.
>         Recursively handle cases where the iteration count is
>         conditional.  Handle quadratic/triangular chrecs by calling
>         the new chrec_apply_triangular function.
>         (chrec_convert_rhs): Handle conversion of integer constants
>         to scalar floating point types here (moved from chrec_apply).
>         * tree-scalar-evolution.cc (interpret_expr): Handle SAVE_EXPR
>         by (tail call) recursion.
>         (expression_expensive_p): Calculate the cost of a SAVE_EXPR
>         as half the cost of its operand, i.e. assume it is reused.
>
> gcc/testsuite/ChangeLog
>         PR target/65855
>         PR c/80852
>         * gcc.dg/tree-ssa/pr65855.c: New test case.
>         * gcc.dg/tree-ssa/pr80852.c: New test case.
>         * gcc.dg/vect/vect-iv-11.c: Update to reflect that the loop is
>         no longer vectorized, but calculated by final value replacement.
>
> Roger
> --
>

Reply via email to