On 03/19/2018 07:04 AM, Tom de Vries wrote: > On 03/09/2018 05:55 PM, Cesar Philippidis wrote: >> On 03/09/2018 08:21 AM, Tom de Vries wrote: >>> On 03/09/2018 12:31 AM, Cesar Philippidis wrote: >>>> Nvidia Volta GPUs now support warp-level synchronization. >>> >>> Well, let's try to make that statement a bit more precise. >>> >>> All Nvidia architectures have supported synchronization of threads in a >>> warp on a very basic level: by means of convergence (and unfortunately, >>> we've seen that this is very error-prone). >>> >>> What is new in ptx 6.0 combined with sm_70 is the ability to sync >>> divergent threads without having to converge, f.i. by using new >>> instructions bar.warp.sync and barrier.sync. >> >> Yes. The major difference sm_70 GPU architectures and earlier GPUs is >> that sm_70 allows the user to explicitly synchronize divergent warps. At >> least on Maxwell and Pascal, the PTX SASS compiler uses two instructions >> to branch, SYNC and BRA. I think, SYNC guarantees that a warp is >> convergent at the SYNC point, whereas BRA makes no such guarantees. >> > > If you want to understand the interplay of sync (or .s suffix), branch > and ssy, please read > https://people.engr.ncsu.edu/hzhou/ispass_15-poster.pdf .
Interesting, thanks! >> What's worse, once a warp has become divergent on sm_60 and earlier >> GPUs, there's no way to reliably reconverge them. So, to avoid that >> problem, it critical that the PTX SASS compiler use SYNC instructions >> when possible. Fortunately, bar.warp.sync resolves the divergent warp >> problem on sm_70+. >> >>>> As such, the >>>> semantics of legacy bar.sync instructions have slightly changed on >>>> newer >>>> GPUs. >>> >>> Before in ptx 3.1, we have for bar.sync: >>> ... >>> Barriers are executed on a per-warp basis as if all the threads in a >>> warp are active. Thus, if any thread in a warp executes a bar >>> instruction, it is as if all the threads in the warp have executed >>> the bar instruction. All threads in the warp are stalled until the >>> barrier completes, and the arrival count for the barrier is incremented >>> by the warp size (not the number of active threads in the warp). In >>> conditionally executed code, a bar instruction should only be used if it >>> is known that all threads evaluate the condition identically (the warp >>> does not diverge). >>> ... >>> >>> But in ptx 6.0, we have: >>> ... >>> bar.sync is equivalent to barrier.sync.aligned >>> ... >>> and: >>> ... >>> Instruction barrier has optional .aligned modifier. When specified, it >>> indicates that all threads in CTA will execute the same barrier >>> instruction. In conditionally executed code, an aligned barrier >>> instruction should only be used if it is known that all threads in >>> CTA evaluate the condition identically, otherwise behavior is undefined. >>> ... >>> >>> So, in ptx 3.1 bar.sync should be executed in convergent mode (all the >>> threads in each warp executing the same). But in ptx 6.0, bar.sync >>> should be executed in the mode that the whole CTA is executing the same >>> code. >>> >>> So going from the description of ptx, it seems indeed that the semantics >>> of bar.sync has changed. That is however surprising, since it would >>> break the forward compatibility that AFAIU is the idea behind ptx. >>> >>> So for now my hope is that this is a documentation error. >> >> I spent a lot of time debugging deadlocks with the vector length changes >> and I have see no changes in the SASS code generated in the newer Nvidia >> drivers when compared to the older ones, at lease with respect to the >> barrier instructions. This isn't the first time I've seen >> inconsistencies with thread synchronization in Nvidia's documentation. >> For the longest time, the "CUDA Programming Guide" provided slightly >> conflicting semantics for the __syncthreads() function, which ultimately >> gets implemented as bar.sync in PTX. >> >>>> The PTX JIT will now, occasionally, emit a warpsync instruction >>>> immediately before a bar.sync for Volta GPUs. That implies that warps >>>> must be convergent on entry to those threads barriers. >>>> >>> >>> That warps must be convergent on entry to bar.sync is already required >>> by ptx 3.1. >>> >>> [ And bar.warp.sync does not force convergence, so if the warpsync >>> instruction you mention is equivalent to bar.warp.sync then your >>> reasoning is incorrect. ] >> >> I'm under the impression that bar.warp.sync converges all of the >> non-exited threads in a warp. > > I have not played around with the instruction yet, so I'm not sure, but > what I read from the docs is that bar.warp.sync converges all of the > non-exited threads in a warp only and only if it's positioned at a point > post-dominating a divergent branch. > > Consider this case: > ... > if (tid.x == 0) > { > A; > bar.warp.sync 32; > B; > } > else > { > C; > bar.warp.sync 32; > D; > } > ... > AFAIU, this allows bar.warp.sync to synchronize the threads in the warp, > _without_ converging. I think that's partially wrong. Check out the literature for CUDA 9 cooperative groups, such as <https://devblogs.nvidia.com/cooperative-groups/>, to get an idea of the intent behind bar.warp.sync. >> You'd still need to use bar.sync or some >> variant of the new barrier instruction to converge the entire CTA. But >> at the moment, we're still generating code that's backwards compatible >> with sm_30. >> >>>> The problem in og7, and trunk, is that GCC emits barrier >>>> instructions at >>>> the wrong spots. E.g., consider the following OpenACC parallel region: >>>> >>>> #pragma acc parallel loop worker >>>> for (i = 0; i < 10; i++) >>>> a[i] = i; >>>> >>>> At -O2, GCC generates the following PTX code: >>>> >>>> { >>>> .reg.u32 %y; >>>> mov.u32 %y, %tid.y; >>>> setp.ne.u32 %r76, %y, 0; >>>> } >>>> { >>>> .reg.u32 %x; >>>> mov.u32 %x, %tid.x; >>>> setp.ne.u32 %r75, %x, 0; >>>> } >>>> @%r76 bra.uni $L6; >>>> @%r75 bra $L7; >>>> mov.u64 %r67, %ar0; >>>> // fork 2; >>>> cvta.shared.u64 %r74, __oacc_bcast; >>>> st.u64 [%r74], %r67; >>>> $L7: >>>> $L6: >>>> @%r75 bra $L5; >>>> // forked 2; >>>> bar.sync 0; >>>> cvta.shared.u64 %r73, __oacc_bcast; >>>> ld.u64 %r67, [%r73]; >>>> mov.u32 %r62, %ntid.y; >>>> mov.u32 %r63, %tid.y; >>>> setp.gt.s32 %r68, %r63, 9; >>>> @%r68 bra $L2; >>>> mov.u32 %r55, %r63; >>>> cvt.s64.s32 %r69, %r62; >>>> shl.b64 %r59, %r69, 2; >>>> cvt.s64.s32 %r70, %r55; >>>> shl.b64 %r71, %r70, 2; >>>> add.u64 %r58, %r67, %r71; >>>> $L3: >>>> st.u32 [%r58], %r55; >>>> add.u32 %r55, %r55, %r62; >>>> add.u64 %r58, %r58, %r59; >>>> setp.le.s32 %r72, %r55, 9; >>>> @%r72 bra $L3; >>>> $L2: >>>> bar.sync 1; >>>> // joining 2; >>>> $L5: >>>> // join 2; >>>> ret; >>>> >>>> Note the bar.sync instructions placed immediately after the forked >>>> comment and before the joining comment. The problem here is that branch >>>> above the forked comment guarantees that the warps are not synchronous >>>> (when vector_length > 1, which is always the case). >>> >>> This is already advised against in ptx 3.1, so yes, we should fix this. >>> >>>> Likewise, bar.sync >>>> instruction before joining should be placed after label L5 in order to >>>> allow all of the threads in the warp to reach it. >>>> >>> >>> Agreed. >>> >>>> The attached patch teaches the nvptx to make those adjustments. >>> >>> Can you show me a diff of the ptx for the test-case above for trunk? >> >> --- w-old.s 2018-03-08 15:19:47.139516578 -0800 >> +++ w.s 2018-03-09 08:42:52.217057332 -0800 >> @@ -46,9 +46,9 @@ >> st.u64 [%r74], %r67; >> $L7: >> $L6: >> - @%r75 bra $L5; >> // forked 2; >> bar.sync 0; >> + @%r75 bra $L5; >> cvta.shared.u64 %r73, __oacc_bcast; >> ld.u64 %r67, [%r73]; >> mov.u32 %r62, %ntid.y; >> @@ -68,9 +68,9 @@ >> setp.le.s32 %r72, %r55, 9; >> @%r72 bra $L3; >> $L2: >> - bar.sync 1; >> // joining 2; >> $L5: >> + bar.sync 1; >> // join 2; >> ret; >> } >> >> > > At -O0, yes. > > At -O2, we have: > ... > diff -u -a 1 2 > --- 1 2018-03-19 14:13:44.074834552 +0100 > +++ 2 2018-03-19 14:15:06.075301168 +0100 > @@ -42,20 +42,20 @@ > st.u64 [%r32],%r25; > $L7: > $L6: > -@ %r33 bra $L5; > // forked 2; > bar.sync 0; > +@ %r33 bra $L5; > cvta.shared.u64 %r31,__worker_bcast; > ld.u64 %r25,[%r31]; > mov.u32 %r24,%tid.y; > setp.le.s32 %r26,%r24,9; > @ %r26 bra $L2; > $L3: > -bar.sync 1; > // joining 2; > $L5: > -@ %r34 bra.uni $L8; > +bar.sync 1; > @ %r33 bra $L9; > +@ %r34 bra.uni $L8; > // join 2; > $L9: > $L8: > ... > > Note that this changes ordering of the vector-neutering jump and > worker-neutering jump at the end. In principle, this should not be > harmful, but it violates the invariant that vector-neutering > branch-around code should be as short-lived as possible. So, this needs > to be fixed. > > I've found this issue by adding verification of the neutering, as > attached below. ACK, thanks. I'll take a closer look at this. Is your patch purely for debugging, or are you planning on committing it to og7 and trunk? Cesar