Re: [og7] Update nvptx_fork/join barrier placement

Cesar Philippidis Mon, 19 Mar 2018 07:56:22 -0700

On 03/19/2018 07:04 AM, Tom de Vries wrote:
> On 03/09/2018 05:55 PM, Cesar Philippidis wrote:
>> On 03/09/2018 08:21 AM, Tom de Vries wrote:
>>> On 03/09/2018 12:31 AM, Cesar Philippidis wrote:
>>>> Nvidia Volta GPUs now support warp-level synchronization.
>>>
>>> Well, let's try to make that statement a bit more precise.
>>>
>>> All Nvidia architectures have supported synchronization of threads in a
>>> warp on a very basic level: by means of convergence (and unfortunately,
>>> we've seen that this is very error-prone).
>>>
>>> What is new in ptx 6.0 combined with sm_70 is the ability to sync
>>> divergent threads without having to converge, f.i. by using new
>>> instructions bar.warp.sync and barrier.sync.
>>
>> Yes. The major difference sm_70 GPU architectures and earlier GPUs is
>> that sm_70 allows the user to explicitly synchronize divergent warps. At
>> least on Maxwell and Pascal, the PTX SASS compiler uses two instructions
>> to branch, SYNC and BRA. I think, SYNC guarantees that a warp is
>> convergent at the SYNC point, whereas BRA makes no such guarantees.
>>
> 
> If you want to understand the interplay of sync (or .s suffix), branch
> and ssy, please read
> https://people.engr.ncsu.edu/hzhou/ispass_15-poster.pdf .


Interesting, thanks!

>> What's worse, once a warp has become divergent on sm_60 and earlier
>> GPUs, there's no way to reliably reconverge them. So, to avoid that
>> problem, it critical that the PTX SASS compiler use SYNC instructions
>> when possible. Fortunately, bar.warp.sync resolves the divergent warp
>> problem on sm_70+.
>>
>>>> As such, the
>>>> semantics of legacy bar.sync instructions have slightly changed on
>>>> newer
>>>> GPUs.
>>>
>>> Before in ptx 3.1, we have for bar.sync:
>>> ...
>>> Barriers are executed on a per-warp basis as if all the threads in a
>>> warp are active. Thus, if any thread in a warp executes a bar
>>> instruction, it is as if all the threads in the warp have executed
>>> the bar instruction. All threads in the warp are stalled until the
>>> barrier completes, and the arrival count for the barrier is incremented
>>> by the warp size (not the number of active threads in the warp). In
>>> conditionally executed code, a bar instruction should only be used if it
>>> is known that all threads evaluate the condition identically (the warp
>>> does not diverge).
>>> ...
>>>
>>> But in ptx 6.0, we have:
>>> ...
>>> bar.sync is equivalent to barrier.sync.aligned
>>> ...
>>> and:
>>> ...
>>> Instruction barrier has optional .aligned modifier. When specified, it
>>> indicates that all threads in CTA will execute the same barrier
>>> instruction. In conditionally executed code, an aligned barrier
>>> instruction should only be used if it is known that all threads in
>>> CTA evaluate the condition identically, otherwise behavior is undefined.
>>> ...
>>>
>>> So, in ptx 3.1 bar.sync should be executed in convergent mode (all the
>>> threads in each warp executing the same). But in ptx 6.0, bar.sync
>>> should be executed in the mode that the whole CTA is executing the same
>>> code.
>>>
>>> So going from the description of ptx, it seems indeed that the semantics
>>> of bar.sync has changed. That is however surprising, since it would
>>> break the forward compatibility that AFAIU is the idea behind ptx.
>>>
>>> So for now my hope is that this is a documentation error.
>>
>> I spent a lot of time debugging deadlocks with the vector length changes
>> and I have see no changes in the SASS code generated in the newer Nvidia
>> drivers when compared to the older ones, at lease with respect to the
>> barrier instructions. This isn't the first time I've seen
>> inconsistencies with thread synchronization in Nvidia's documentation.
>> For the longest time, the "CUDA Programming Guide" provided slightly
>> conflicting semantics for the __syncthreads() function, which ultimately
>> gets implemented as bar.sync in PTX.
>>
>>>> The PTX JIT will now, occasionally, emit a warpsync instruction
>>>> immediately before a bar.sync for Volta GPUs. That implies that warps
>>>> must be convergent on entry to those threads barriers.
>>>>
>>>
>>> That warps must be convergent on entry to bar.sync is already required
>>> by ptx 3.1.
>>>
>>> [ And bar.warp.sync does not force convergence, so if the warpsync
>>> instruction you mention is equivalent to bar.warp.sync then your
>>> reasoning is incorrect. ]
>>
>> I'm under the impression that bar.warp.sync converges all of the
>> non-exited threads in a warp.
> 
> I have not played around with the instruction yet, so I'm not sure, but
> what I read from the docs is that bar.warp.sync converges all of the
> non-exited threads in a warp only and only if it's positioned at a point
> post-dominating a divergent branch.
> 
> Consider this case:
> ...
> if (tid.x == 0)
>   {
>     A;
>     bar.warp.sync 32;
>     B;
>   }
> else
>   {
>     C;
>     bar.warp.sync 32;
>     D;
>   }
> ...
> AFAIU, this allows bar.warp.sync to synchronize the threads in the warp,
> _without_ converging.

I think that's partially wrong. Check out the literature for CUDA 9
cooperative groups, such as
<https://devblogs.nvidia.com/cooperative-groups/>, to get an idea of the
intent behind bar.warp.sync.

>> You'd still need to use bar.sync or some
>> variant of the new barrier instruction to converge the entire CTA. But
>> at the moment, we're still generating code that's backwards compatible
>> with sm_30.
>>
>>>> The problem in og7, and trunk, is that GCC emits barrier
>>>> instructions at
>>>> the wrong spots. E.g., consider the following OpenACC parallel region:
>>>>
>>>>     #pragma acc parallel loop worker
>>>>     for (i = 0; i < 10; i++)
>>>>       a[i] = i;
>>>>
>>>> At -O2, GCC generates the following PTX code:
>>>>
>>>>           {
>>>>                   .reg.u32        %y;
>>>>                   mov.u32 %y, %tid.y;
>>>>                   setp.ne.u32     %r76, %y, 0;
>>>>           }
>>>>           {
>>>>                   .reg.u32        %x;
>>>>                   mov.u32 %x, %tid.x;
>>>>                   setp.ne.u32     %r75, %x, 0;
>>>>           }
>>>>           @%r76   bra.uni $L6;
>>>>           @%r75   bra     $L7;
>>>>                   mov.u64 %r67, %ar0;
>>>>           // fork 2;
>>>>                   cvta.shared.u64 %r74, __oacc_bcast;
>>>>                   st.u64  [%r74], %r67;
>>>> $L7:
>>>> $L6:
>>>>           @%r75   bra     $L5;
>>>>           // forked 2;
>>>>                   bar.sync        0;
>>>>                   cvta.shared.u64 %r73, __oacc_bcast;
>>>>                   ld.u64  %r67, [%r73];
>>>>                   mov.u32 %r62, %ntid.y;
>>>>                   mov.u32 %r63, %tid.y;
>>>>                   setp.gt.s32     %r68, %r63, 9;
>>>>           @%r68   bra     $L2;
>>>>                   mov.u32 %r55, %r63;
>>>>                   cvt.s64.s32     %r69, %r62;
>>>>                   shl.b64 %r59, %r69, 2;
>>>>                   cvt.s64.s32     %r70, %r55;
>>>>                   shl.b64 %r71, %r70, 2;
>>>>                   add.u64 %r58, %r67, %r71;
>>>> $L3:
>>>>                   st.u32  [%r58], %r55;
>>>>                   add.u32 %r55, %r55, %r62;
>>>>                   add.u64 %r58, %r58, %r59;
>>>>                   setp.le.s32     %r72, %r55, 9;
>>>>           @%r72   bra     $L3;
>>>> $L2:
>>>>                   bar.sync        1;
>>>>           // joining 2;
>>>> $L5:
>>>>           // join 2;
>>>>           ret;
>>>>
>>>> Note the bar.sync instructions placed immediately after the forked
>>>> comment and before the joining comment. The problem here is that branch
>>>> above the forked comment guarantees that the warps are not synchronous
>>>> (when vector_length > 1, which is always the case).
>>>
>>> This is already advised against in ptx 3.1, so yes, we should fix this.
>>>
>>>> Likewise, bar.sync
>>>> instruction before joining should be placed after label L5 in order to
>>>> allow all of the threads in the warp to reach it.
>>>>
>>>
>>> Agreed.
>>>
>>>> The attached patch teaches the nvptx to make those adjustments.
>>>
>>> Can you show me a diff of the ptx for the test-case above for trunk?
>>
>> --- w-old.s     2018-03-08 15:19:47.139516578 -0800
>> +++ w.s 2018-03-09 08:42:52.217057332 -0800
>> @@ -46,9 +46,9 @@
>>                  st.u64  [%r74], %r67;
>>   $L7:
>>   $L6:
>> -       @%r75   bra     $L5;
>>          // forked 2;
>>                  bar.sync        0;
>> +       @%r75   bra     $L5;
>>                  cvta.shared.u64 %r73, __oacc_bcast;
>>                  ld.u64  %r67, [%r73];
>>                  mov.u32 %r62, %ntid.y;
>> @@ -68,9 +68,9 @@
>>                  setp.le.s32     %r72, %r55, 9;
>>          @%r72   bra     $L3;
>>   $L2:
>> -               bar.sync        1;
>>          // joining 2;
>>   $L5:
>> +               bar.sync        1;
>>          // join 2;
>>          ret;
>>   }
>>
>>
> 
> At -O0, yes.
> 
> At -O2, we have:
> ...
>  diff -u -a 1 2
> --- 1   2018-03-19 14:13:44.074834552 +0100
> +++ 2   2018-03-19 14:15:06.075301168 +0100
> @@ -42,20 +42,20 @@
>  st.u64 [%r32],%r25;
>  $L7:
>  $L6:
> -@ %r33 bra $L5;
>  // forked 2;
>  bar.sync 0;
> +@ %r33 bra $L5;
>  cvta.shared.u64 %r31,__worker_bcast;
>  ld.u64 %r25,[%r31];
>  mov.u32 %r24,%tid.y;
>  setp.le.s32 %r26,%r24,9;
>  @ %r26 bra $L2;
>  $L3:
> -bar.sync 1;
>  // joining 2;
>  $L5:
> -@ %r34 bra.uni $L8;
> +bar.sync 1;
>  @ %r33 bra $L9;
> +@ %r34 bra.uni $L8;
>  // join 2;
>  $L9:
>  $L8:
> ...
> 
> Note that this changes ordering of the vector-neutering jump and
> worker-neutering jump at the end. In principle, this should not be
> harmful, but it violates the invariant that vector-neutering
> branch-around code should be as short-lived as possible. So, this needs
> to be fixed.
> 
> I've found this issue by adding verification of the neutering, as
> attached below.

ACK, thanks. I'll take a closer look at this.

Is your patch purely for debugging, or are you planning on committing it
to og7 and trunk?

Cesar

Re: [og7] Update nvptx_fork/join barrier placement

Reply via email to