On Fri, Feb 18, 2022 at 7:20 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch improves GCC's scalar evolution and final value replacement > optimizations by supporting triangular/quadratic/trapezoid chrecs which > resolves both PR middle-end/65855 and PR c/80852, but alas not (yet) > PR tree-optimization/46186. > > I've listed Richard Biener as co-author, as this solution is based > heavily on his proposed patch in comment #4 of PR 65855 from 2015, > but with several significant changes. The most important change is > a correction to formula used. For the scalar evolution {a, +, {b, +, c}}, > there was an off-by-one error, so chrec_apply should not return > a + b*x + c*x*(x+1)/2, but a + b*x + c*x*(x-1)/2, which explains > why the original patch didn't produce the expected results. > > Another significant set of changes involves handling the various > type mismatches and overflows. In chrec_apply the evolution is > evaluated after x iterations (which is an unsigned integer type, > called orig_type in this patch) but a, b and c may be any type > including floating point (see PR tree-opt/23391) and including > types that trap on overflow with -ftrapv (see PR tree-opt/79721), > and in the case of pointer arithmetic, b and c may have a > different type (sizetype) to a! Additionally, the division by > two in "c*x*(x-1)/2" requires the correct top bit in modulo > arithmetic, which means that the multiplication typically needs > to be performed with more precision (in a wider mode) than > orig_type [unless c is an even integer constant, or x (the > number of iterations) is a known constant at compile-time]. > > Finally, final value replacement is only an effective optimization > if the expression used to compute the result of the loop is cheaper > than the loop itself, hence chrec_apply needs to return an optimized > folded tree with the minimal number of operators. Hence if b == c, > this patch calculates "a + c*x*(x+1)/2", when c is even we can > perform the division at compile-time, and when x is a non-trivial > expression, we wrap it in a SAVE_EXPR so that the lowering to > gimple can reuse the common subexpression. > > Correctly handling all of the corner cases results in a patch > significantly larger than the original "proof-of-concept". > There's one remaining refinement, marked as TODO in the patch, > which is to also support unsigned 64-bit to 128-bit widening > multiplications (umulditi3) on targets that support it, such as > x86_64, as used by LLVM, but this patch provides the core > target-independent functionality. [This last piece would > handle/resolve the testcase in PR tree-opt/46186 which > requires 128-bit TImode operations, not suitable for all > backend targets]. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check with no new failures. Ok for mainline?
+/* Fold the (exact) division of chrec by two. */ + +static tree +chrec_fold_divide_by_2 (tree type, tree op) +{ + if (SCALAR_FLOAT_TYPE_P (type)) + return fold_build2 (MULT_EXPR, type, op, build_real (type, dconsthalf)); + return fold_build2 (RSHIFT_EXPR, type, op, build_int_cst (type, 1)); any reason to not use EXACT_DIV_EXPR? +/* Indicate that a chrec subexpression is used multiple times. */ + +static tree +chrec_save_expr (tree t) +{ + if (is_gimple_min_invariant (t)) + return t; + t = save_expr (t); + if (TREE_CODE (t) == SAVE_EXPR) + TREE_SIDE_EFFECTS (t) = 0; + return t; why clear TREE_SIDE_EFFECTS here? IIRC SCEV analysis simply re-uses subtrees (thus generates a graph) without wrapping the shared leafs in SAVE_EXPR (and then have unshare_expr before gimplification blow it up of course...). + /* Determine whether c is even. */ + tree half_c = NULL_TREE; + if (TREE_CODE (c) == INTEGER_CST + && (wi::to_widest (c) & 1) == 0) wi::to_wide should work as well and is cheaper + /* Try to narrow the original type, by stripping zero extensions. */ + tree orig_type = TREE_TYPE (x); + if (TREE_CODE (x) == NOP_EXPR) + { so we do have get_narrower (), can we use that? + tree wide_type; + /* Prefer to perform multiplications in type CTYPE. */ + if (INTEGRAL_TYPE_P (ctype) + && TYPE_UNSIGNED (ctype) + && TYPE_PRECISION (ctype) > prec) + wide_type = ctype; + else if (TYPE_PRECISION (unsigned_type_node) > prec) + wide_type = unsigned_type_node; + else if (TYPE_PRECISION (long_unsigned_type_node) > prec) + wide_type = long_unsigned_type_node; + else if (TYPE_PRECISION (long_long_unsigned_type_node) > prec) + wide_type = long_long_unsigned_type_node; + else + /* TODO: Try TImode on targets that support umul_widen_optab. */ + return chrec_dont_know; can we use mode_for_size () and type_for_mode () instead of hard-coding these? You can look for a umul_optab if we do not want to risk libgcc calls. There's some code that looks like it can be factored. Maybe my eyes are mis-matching things of course: + res = chrec_fold_plus (wide_type, save_x, one); + res = chrec_fold_multiply (wide_type, res, save_x); + if (half_c) + { + res = chrec_convert (ctype, res, NULL); + res = chrec_fold_multiply (ctype, half_c, res); + } + else + { + res = chrec_fold_divide_by_2 (wide_type, res); + res = chrec_convert (ctype, res, NULL); + res = chrec_fold_multiply (ctype, b, res); + } + /* + a */ + if (ctype != type) + res = chrec_convert (type, res, NULL); + res = chrec_fold_plus (TREE_TYPE (a), a, res); + return res; looks like a common tail with one optional + b*x? @@ -3469,6 +3472,19 @@ expression_expensive_p (tree expr, hash_map<tree, uint64_t> &cache, cost += op_cost + 1; return false; + case tcc_expression: + if (code == SAVE_EXPR) + { + if (expression_expensive_p (TREE_OPERAND (expr, 0), cache, op_cost)) + return true; + /* Assume SAVE_EXPR is instanced twice. */ + op_cost /= 2; + *cache.get (expr) += op_cost; + cost += op_cost; + return false; + } + return true; + default: return true; } I think expression_expensive_p assumes we unshare_expr () (correctly, because we do...) and thus counts each use of a multi-use. With SAVE_EXPRs it should instead have operated in a walk_tree_without_duplicates mode. Maybe you should bite the bullet and wrap expression_expensive_p, allocating a pointer-set to simply not visit SAVE_EXPRs multiple times (and not caching it in 'cache')? I've added the cache explicitely to avoid expression_expensive_p to be exponential but make it compute that unshare_expr will be. Maybe the SAVE_EXPR behavior can be more precisely handled? Looks like stage1 material btw. Thanks, Richard. > > 2022-02-18 Roger Sayle <ro...@nextmovesoftware.com> > Richard Biener <rguent...@suse.de> > > gcc/ChangeLog > PR target/65855 > PR c/80852 > * tree-chrec.cc (chrec_fold_divide_by_2): New function to > divide a chrec by two, honoring the type of the chrec. > (chrec_save_expr): New function to wrap a chrec in a > SAVE_EXPR, but without TREE_SIDE_EFFECTS. > (chrec_apply_triangular): New helper function of chrec > apply to evaluate a quadratic/triangular chrec. > (chrec_apply): Expand/clarify comment before function. > Handle application of any chrec after zero iterations, i.e. A. > Recursively handle cases where the iteration count is > conditional. Handle quadratic/triangular chrecs by calling > the new chrec_apply_triangular function. > (chrec_convert_rhs): Handle conversion of integer constants > to scalar floating point types here (moved from chrec_apply). > * tree-scalar-evolution.cc (interpret_expr): Handle SAVE_EXPR > by (tail call) recursion. > (expression_expensive_p): Calculate the cost of a SAVE_EXPR > as half the cost of its operand, i.e. assume it is reused. > > gcc/testsuite/ChangeLog > PR target/65855 > PR c/80852 > * gcc.dg/tree-ssa/pr65855.c: New test case. > * gcc.dg/tree-ssa/pr80852.c: New test case. > * gcc.dg/vect/vect-iv-11.c: Update to reflect that the loop is > no longer vectorized, but calculated by final value replacement. > > Roger > -- >