[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

rguenth at gcc dot gnu.org Fri, 27 Jan 2017 03:07:26 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to rguent...@suse.de from comment #5)
> On Fri, 27 Jan 2017, jgreenhalgh at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
> > 
> > --- Comment #4 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> ---
> > (In reply to Richard Biener from comment #3)
> > > Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like
> > > 
> > >   int i;
> > >   for (i = 0; i < 128; ++i)
> > >     {
> > >       a[i] = a[i] + 1;
> > >       b[i] = d[i];
> > >       c[i] = a[i] / d[i];
> > >     }
> > > 
> > > where the testcase expects b[i] = d[i] to be split out as memcpy but
> > > the other two partitions to be fused.
> > > 
> > > Generally the cost model lacks computing the number of input/output 
> > > streams
> > > of a partition and a target interface to query it about limits.  Usually
> > > store bandwidth is not equal to load bandwidth and not re-used store 
> > > streams
> > > can benefit from non-temporal stores being used by libc.
> > > 
> > > In your testcase I wonder whether distributing to
> > > 
> > >     for (int j = 0; j < x; j++)
> > >       {
> > >         for (int i = 0; i < y; i++)
> > >     {
> > >       c[j][i] = b[j][i] - a[j][i];
> > >           }
> > >       }
> > >     memcpy (a, b, ...);
> > > 
> > > would be faster in the end (or even doing the memcpy first in this case).
> > > 
> > > Well, for now let's be more conservative given the cost model really is
> > > lacking.
> > 
> > The testcase is reduced from CALC3 in 171.swim. I've been seeing a 3%
> > regression for Cortex-A72 after r242038, and I can fix that with
> > -fno-tree-loop-distribute-patterns.
> > 
> > In that benchmark you've got 3 instances of the above pattern, so you end up
> > with 3 memcpy calls after:
> > 
> >       DO 300 J=1,N
> >       DO 300 I=1,M
> >       UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J))
> >       VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J))
> >       POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J))
> >       U(I,J) = UNEW(I,J)
> >       V(I,J) = VNEW(I,J)
> >       P(I,J) = PNEW(I,J)
> >   300 CONTINUE
> > 
> > 3 memcpy calls compared to 3 vector store instructions doesn't seem the 
> > right
> > tradeoff to me. Sorry if I reduced the testcase too far to make the balance
> > clear.
> 
> Itanic seems to like it though:
> 
> http://gcc.opensuse.org/SPEC/CFP/sb-terbium-head-64/171_swim_big.png
> 
> For Haswell I don't see any ups/downs for AMD Fam15 I see a slowdown
> as well around that time.  I guess it depends if the CPU is already
> throttled by load/compute bandwith here.  What should be profitable
> is to distribute the above to three loops (w/o memcpy).  So after
> the patch doing -ftree-loop-distribution.  Patch being
> 
> Index: gcc/tree-loop-distribution.c
> ===================================================================
> --- gcc/tree-loop-distribution.c        (revision 244963)
> +++ gcc/tree-loop-distribution.c        (working copy)
> @@ -1548,8 +1548,7 @@ distribute_loop (struct loop *loop, vec<
>        for (int j = i + 1;
>            partitions.iterate (j, &partition); ++j)
>         {
> -         if (!partition_builtin_p (partition)
> -             && similar_memory_accesses (rdg, into, partition))
> +         if (similar_memory_accesses (rdg, into, partition))
>             {
>               if (dump_file && (dump_flags & TDF_DETAILS))
>                 {

The distribution to three loops doesn't work because (fortran - eh) all
arrays are part of the same (unnamed) common block and thus the memory
accesses appear related -- in GIMPLE we see __BLNK__.uold[] and __BLNK__.unew[]
and thus they appear as part of the same structure.  The cost model
is really too simple (not considering dependence distance or considering
non-dependent accesses to be of infinite distance and thus unrelated - of
course
we compute dependence only after applying the cost model at the moment).
It works with some extra (partly redundant) dependence checking:

Index: gcc/tree-loop-distribution.c
===================================================================
--- gcc/tree-loop-distribution.c        (revision 244963)
+++ gcc/tree-loop-distribution.c        (working copy)
@@ -1199,8 +1199,8 @@ ref_base_address (data_reference_p dr)
    accesses in RDG.  */

 static bool
-similar_memory_accesses (struct graph *rdg, partition *partition1,
-                        partition *partition2)
+similar_memory_accesses (struct graph *rdg, vec<loop_p> loops,
+                        partition *partition1, partition *partition2)
 {
   unsigned i, j, k, l;
   bitmap_iterator bi, bj;
@@ -1228,7 +1228,18 @@ similar_memory_accesses (struct graph *r
                if (base1)
                  FOR_EACH_VEC_ELT (RDG_DATAREFS (rdg, j), l, ref2)
                    if (base1 == ref_base_address (ref2))
-                     return true;
+                     {
+                       ddr_p ddr = initialize_data_dependence_relation
+                                       (ref1, ref2, loops);
+                       compute_affine_dependence (ddr, loops[0]);
+                       tree res = DDR_ARE_DEPENDENT (ddr);
+                       free_dependence_relation (ddr);
+                       /* If the refs are independent then they constitute
+                          different memory streams.
+                          ???  Handle known large enough distance.  */
+                       if (res != chrec_known)
+                         return true;
+                     }
              }
          }

@@ -1548,8 +1559,7 @@ distribute_loop (struct loop *loop, vec<
       for (int j = i + 1;
           partitions.iterate (j, &partition); ++j)
        {
-         if (!partition_builtin_p (partition)
-             && similar_memory_accesses (rdg, into, partition))
+         if (similar_memory_accesses (rdg, loop_nest, into, partition))
            {
              if (dump_file && (dump_flags & TDF_DETAILS))
                {

I'm curious how that benchmarks for you (with -ftree-loop-distribution vs.
without).

[Bug tree-optimization/79245] [7 Regression] Inefficient loop distribution to memcpy

Reply via email to