Just my two cents.

Let's consider a simple example:

for (int I = 0; I < 2; I++) {
  A = Compute();
  barrier();
  Use(A);
}

This does not mean Compute() needs to be executed for all the loop iterations 
before barrier() being executed. This only means that during each loop 
iteration, Compute() needs to be executed by all threads before barrier is 
executed by all threads. Otherwise, we cannot use barrier() in a loop at all.

Semantically the above program is not different from the following one:

  A = Compute();
  barrier();
  Use(A);
  A = Compute();
  barrier();
  Use(A);

Even if Compute() and Use() have side effect, they are still equivalent 
programs.

Sam

-----Original Message-----
From: Anastasia Stulova [mailto:anastasia.stul...@arm.com] 
Sent: Monday, October 24, 2016 1:08 PM
To: Liu, Yaxun (Sam) <yaxun....@amd.com>; alexey.ba...@intel.com; 
anastasia.stul...@arm.com; aaron.ball...@gmail.com
Cc: Stellard, Thomas <tom.stell...@amd.com>; Arsenault, Matthew 
<matthew.arsena...@amd.com>; Sumner, Brian <brian.sum...@amd.com>; 
cfe-commits@lists.llvm.org
Subject: [PATCH] D25343: [OpenCL] Mark group functions as convergent in 
opencl-c.h

Anastasia added a comment.

In https://reviews.llvm.org/D25343#567374, @tstellarAMD wrote:

> In https://reviews.llvm.org/D25343#565288, @Anastasia wrote:
>
> > Do you have any code example where Clang/LLVM performs wrong optimizations 
> > with respect to the control flow of SPMD execution?
> >
> > My understanding from the earlier discussion we have had: 
> > https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg22643.html that 
> > noduplicate is essentially enough for the frontend to prevent erroneous 
> > optimizations. Because in general compiler can't do much with unknown 
> > function calls.
>
>
> noduplicate is enough for correctness, but it prevents legal optimizations, 
> like unrolling loops with barriers.  The convergent attribute was added 
> specifically for these kinds of builtins, so we should be using it here 
> instead of noduplicate.
>
> > For LLVM intrinsics it is slightly different as I can deduce from this 
> > discussion:http://lists.llvm.org/pipermail/llvm-dev/2015-May/085558.html . 
> > It seems like by default it's assumed to be side effect free and can be 
> > optimized in various ways.




In https://reviews.llvm.org/D25343#567374, @tstellarAMD wrote:

> In https://reviews.llvm.org/D25343#565288, @Anastasia wrote:
>
> > Do you have any code example where Clang/LLVM performs wrong optimizations 
> > with respect to the control flow of SPMD execution?
> >
> > My understanding from the earlier discussion we have had: 
> > https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg22643.html that 
> > noduplicate is essentially enough for the frontend to prevent erroneous 
> > optimizations. Because in general compiler can't do much with unknown 
> > function calls.
>
>
> noduplicate is enough for correctness, but it prevents legal optimizations, 
> like unrolling loops with barriers.  The convergent attribute was added 
> specifically for these kinds of builtins, so we should be using it here 
> instead of noduplicate.
>
> > For LLVM intrinsics it is slightly different as I can deduce from this 
> > discussion:http://lists.llvm.org/pipermail/llvm-dev/2015-May/085558.html . 
> > It seems like by default it's assumed to be side effect free and can be 
> > optimized in various ways.


Tom, as far as I understand a valid generalized use case with a barrier is as 
follows:

  a = compute() 
  barrier()
  use(a)

Regarding unrolling I don't see how the barrier can be unrolled without 
breaking the correctness of OpenCL programs because all compute() calls of one 
WG have to complete before use(a) is invoked for the first time. I am not an 
expert in LLVM transformations but if I read the description of both 
noduplicate and convergent (http://llvm.org/docs/LangRef.html) none of those 
seem to make sense to me for the barrier because both allow reordering within 
the same function or BB respectively and the only valid way seems to mark it as 
a side-effect intrinsic to prevent any reordering of the barrier itself. I 
understand within compute() or use() independently there are not limitations on 
ordering of instructions.

If you have any specific example in mind where the unrolling would work could 
you please elaborate on it. I think it's essential for the community to 
undertsnad the use cases for noduplicate/convergent/sideeffect to make better 
use of them also across similar models i.e. OpenCL, CUDA, etc.


https://reviews.llvm.org/D25343



_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to