[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

rguenth at gcc dot gnu.org Thu, 11 Oct 2018 03:09:01 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87561


--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rsand...@gcc.gnu.org from comment #5)
> (In reply to Richard Biener from comment #4)
> > Another thing is the too complicated alias check where for
> > 
> > (gdb) p debug_data_reference (dr_a.dr)
> > #(Data Ref: 
> > #  bb: 14 
> > #  stmt: _28 = *xpqkl_172(D)[_27];
> > #  ref: *xpqkl_172(D)[_27];
> > #  base_object: *xpqkl_172(D);
> > #  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> > offset.34_149) + _480, +, stride.33_148}_6
> > #)
> > $9 = void
> > (gdb) p debug_data_reference (dr_b.dr)
> > #(Data Ref: 
> > #  bb: 14 
> > #  stmt: *xpqkl_172(D)[_50] = _65;
> > #  ref: *xpqkl_172(D)[_50];
> > #  base_object: *xpqkl_172(D);
> > #  Access function 0: {(((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> > offset.34_149) + _486, +, stride.33_148}_6
> > #)
> > 
> > we generate
> > 
> > (ssizetype) (((sizetype) ((((integer(kind=8)) mkl_203 + 1) * stride.33_148 +
> > offset.34_149) + (integer(kind=8)) (_19 + jpack_161)) + (sizetype)
> > stride.33_148) * 8) < (ssizetype) ((sizetype) ((((integer(kind=8)) mkl_203 +
> > 1) * stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) *
> > 8) || (ssizetype) (((sizetype) ((((integer(kind=8)) mkl_203 + 1) *
> > stride.33_148 + offset.34_149) + (integer(kind=8)) (_22 + lpack_164)) +
> > (sizetype) stride.33_148) * 8) < (ssizetype) ((sizetype)
> > ((((integer(kind=8)) mkl_203 + 1) * stride.33_148 + offset.34_149) +
> > (integer(kind=8)) (_19 + jpack_161)) * 8)
> > 
> > instead of simply _480 != _486 (well, OK, not _that_ simple).
> > 
> > I guess we miss many of the "optimizations" we do when dealing with
> > alias checks for constant steps.  In this case sth obvious would be
> > to special-case DR_STEP (dra) == DR_STEP (drb).  Richard?
> Not sure that would help much with the existing optimisations.
> I think the closest we get is create_intersect_range_checks_index,
> but "all" that avoids is scaling the index by the element size
> and adding the common base.  I guess the expensive bit here is
> multiplying by the stride, but the index-based check would still
> do that.
> 
> That said, create_intersect_range_checks_index does feel like it
> might be a bit *too* conservative (but I'm not brave enough to relax it)

One thing I notice above is that we do

 (ssizetype) ((sizetype)X * 8) < (ssizetype) ((sizetype)Y * 8)

that is, we do a signed comparison but do the multiplication in a type
that allows wrapping.  I suppose this is an artifact of using
DR_OFFSET and friends.

Iff dependence analysis which really looks at the access functions
iff the base is compatible would be able to return non-constant
distance vectors then it would return _231 - _225 as distance which
we could runtime-check against the vectorization factor.  I suppose
that's a feasible trick to try when code-generating the dependence check.

Note for 416.gamess it looks like NOC is just 5 but MPQ and MRS are so
that there is no runtime aliasing between iterations most of the time
(sometimes they are indeed equal).  The cost model check skips the
vector loop for MK == 2 and 3 and only will execute it for MK == 4 and 5.
An alternative for this kind of loop nest would be to cost-model for
MK % 2 == 0, thus requiring no epilogue loop.

A hack for doing the above is sth like the following which I think
would also work for more than one subscript by combining the tests
with ||  I think we need to actually test against the vectorization
factor here and we can ignore negative distances unless ddr_reversed, etc.,
unfortunately compute_affine_dependence frees the subscripts so we
cannot compute the "variable" distance vector during dependence analysis
and store it away - thus "hack" ;)

diff --git a/gcc/tree-data-ref.c b/gcc/tree-data-ref.c
index 69c5f7b28ae..8973a4557d7 100644
--- a/gcc/tree-data-ref.c
+++ b/gcc/tree-data-ref.c
@@ -1823,6 +1823,30 @@ create_intersect_range_checks (struct loop *loop, tree
*cond_expr,
   if (create_intersect_range_checks_index (loop, cond_expr, dr_a, dr_b))
     return;

+  auto_vec<loop_p> loop_nest;
+  bool res = find_loop_nest (loop, &loop_nest);
+  gcc_assert (res);
+  ddr_p ddr = initialize_data_dependence_relation (dr_a.dr, dr_b.dr,
loop_nest);
+  if (DDR_SUBSCRIPTS (ddr).length () == 1)
+    {
+      tree fna = SUB_ACCESS_FN (DDR_SUBSCRIPTS (ddr)[0], 0);
+      tree fnb = SUB_ACCESS_FN (DDR_SUBSCRIPTS (ddr)[0], 1);
+      tree diff = chrec_fold_minus (TREE_TYPE (fna), fna, fnb);
+      if (!chrec_contains_undetermined (diff)
+         && !tree_contains_chrecs (diff, NULL))
+       {
+         free_dependence_relation (ddr);
+         if (TYPE_UNSIGNED (TREE_TYPE (diff)))
+           diff = fold_convert (signed_type_for (TREE_TYPE (diff)), diff);
+         *cond_expr = fold_build2 (GE_EXPR, boolean_type_node,
+                                   fold_build1 (ABS_EXPR,
+                                                TREE_TYPE (diff), diff),
+                                   build_int_cst (TREE_TYPE (diff), 2));
+         return;
+       }
+    }
+  free_dependence_relation (ddr);
+
   unsigned HOST_WIDE_INT min_align;
   tree_code cmp_code;
   if (TREE_CODE (DR_STEP (dr_a.dr)) == INTEGER_CST

benchmarking this change doesn't reveal any change though, improving
the non-constant stride dependence checks this way might still be
worthwhile though.

[Bug target/87561] [9 Regression] 416.gamess is slower by ~10% starting from r264866 with -Ofast

Reply via email to