On Tue, 20 Mar 2018, Bin.Cheng wrote:

> On Mon, Mar 19, 2018 at 11:57 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> > On Fri, Mar 16, 2018 at 03:08:17PM +0100, Richard Biener wrote:
> >> Bootstrapped and tested on x86_64-unknown-linux-gnu, ok?
> >
> > Ok, thanks.
> Hi Richard,
> 
> This change causes below test failure on aarch64/arm linux:
> FAIL: libgomp.graphite/force-parallel-4.c scan-tree-dump-times
> graphite "2 loops carried no dependency" 1

So it looks like the conditional stores to A[i] in

  for (i = 0; i < N; i++)
    {
      if (i < T)
        /* loop i1: carried no dependency.  */
        A[i] = A[i+T];
      else
        /* loop i2: carried dependency.  */
        A[i] = A[i+T+1];
    }

vs. the stores commoned in the merge block causes the merge
block to appear in the schedule and thus

 [scheduler] original ast:
 {
   for (int c0 = 0; c0 <= 19999; c0 += 1)
     S_3(c0);
-  for (int c0 = 0; c0 <= 999; c0 += 1)
-    S_5(c0);
-  for (int c0 = 1000; c0 <= 9999; c0 += 1)
-    S_6(c0);
+  for (int c0 = 0; c0 <= 9999; c0 += 1) {
+    if (c0 >= 1000) {
+      S_6(c0);
+    } else {
+      S_5(c0);
+    }
+    S_7(c0);
+  }
 }

 [scheduler] AST generated by isl:
 {
   for (int c0 = 0; c0 <= 19999; c0 += 1)
     S_3(c0);
-  for (int c0 = 0; c0 <= 999; c0 += 1)
+  for (int c0 = 0; c0 <= 999; c0 += 1) {
     S_5(c0);
-  for (int c0 = 1000; c0 <= 9999; c0 += 1)
+    S_7(c0);
+  }
+  for (int c0 = 1000; c0 <= 9999; c0 += 1) {
     S_6(c0);
+    S_7(c0);
+  }
 }

not sure why neither of the two loops end up parallelizable then.
Possibly because of the dependences we generate for the cross-BB
scalar deps which store and load to a "scalar" memory location
(in the end those are just SSA dependences).  So it's an artificial
limitation caused by how we represent scheduling dependences
I guess:

+Adding scalar write: _4
+From stmt: _4 = A[_3];
+Adding scalar write: cstore_13
+From stmt: cstore_13 = PHI <_2(5), _4(6)>
...
+Adding scalar write: _2
+From stmt: _2 = A[_1];
+Adding scalar write: cstore_13
+From stmt: cstore_13 = PHI <_2(5), _4(6)>
...
+Adding scalar read: cstore_13
+From stmt: cstore_13 = PHI <_2(5), _4(6)>

I have no idea/plan to improve that on the GRAPHITE/ISL side right now
but maybe Sebastian has?  We'd need to tell ISL that we can "elide"
the dependence or rather that it is inner-iteration only and doesn't
leak cross-iteration - kind of clobbering it at the end?

So I'm going to XFAIL the testcase.

Thanks,
Richard.

Reply via email to