Re: [og7] Update nvptx_fork/join barrier placement

Tom de Vries Fri, 09 Mar 2018 08:22:33 -0800

On 03/09/2018 12:31 AM, Cesar Philippidis wrote:

Nvidia Volta GPUs now support warp-level synchronization.


Well, let's try to make that statement a bit more precise.

All Nvidia architectures have supported synchronization of threads in awarp on a very basic level: by means of convergence (and unfortunately,we've seen that this is very error-prone).

What is new in ptx 6.0 combined with sm_70 is the ability to syncdivergent threads without having to converge, f.i. by using newinstructions bar.warp.sync and barrier.sync.

As such, the
semantics of legacy bar.sync instructions have slightly changed on newer
GPUs.


Before in ptx 3.1, we have for bar.sync:
...

Barriers are executed on a per-warp basis as if all the threads in awarp are active. Thus, if any thread in a warp executes a barinstruction, it is as if all the threads in the warp have executedthe bar instruction. All threads in the warp are stalled until thebarrier completes, and the arrival count for the barrier is incrementedby the warp size (not the number of active threads in the warp). Inconditionally executed code, a bar instruction should only be used if itis known that all threads evaluate the condition identically (the warpdoes not diverge).

...

But in ptx 6.0, we have:
...
bar.sync is equivalent to barrier.sync.aligned
...
and:
...

Instruction barrier has optional .aligned modifier. When specified, itindicates that all threads in CTA will execute the same barrierinstruction. In conditionally executed code, an aligned barrierinstruction should only be used if it is known that all threads in

CTA evaluate the condition identically, otherwise behavior is undefined.
...

So, in ptx 3.1 bar.sync should be executed in convergent mode (all thethreads in each warp executing the same). But in ptx 6.0, bar.syncshould be executed in the mode that the whole CTA is executing the samecode.

So going from the description of ptx, it seems indeed that the semanticsof bar.sync has changed. That is however surprising, since it wouldbreak the forward compatibility that AFAIU is the idea behind ptx.


So for now my hope is that this is a documentation error.

The PTX JIT will now, occasionally, emit a warpsync instruction
immediately before a bar.sync for Volta GPUs. That implies that warps
must be convergent on entry to those threads barriers.

That warps must be convergent on entry to bar.sync is already requiredby ptx 3.1.

[ And bar.warp.sync does not force convergence, so if the warpsyncinstruction you mention is equivalent to bar.warp.sync then yourreasoning is incorrect. ]

The problem in og7, and trunk, is that GCC emits barrier instructions at
the wrong spots. E.g., consider the following OpenACC parallel region:

   #pragma acc parallel loop worker
   for (i = 0; i < 10; i++)
     a[i] = i;

At -O2, GCC generates the following PTX code:

         {
                 .reg.u32        %y;
                 mov.u32 %y, %tid.y;
                 setp.ne.u32     %r76, %y, 0;
         }
         {
                 .reg.u32        %x;
                 mov.u32 %x, %tid.x;
                 setp.ne.u32     %r75, %x, 0;
         }
         @%r76   bra.uni $L6;
         @%r75   bra     $L7;
                 mov.u64 %r67, %ar0;
         // fork 2;
                 cvta.shared.u64 %r74, __oacc_bcast;
                 st.u64  [%r74], %r67;
$L7:
$L6:
         @%r75   bra     $L5;
         // forked 2;
                 bar.sync        0;
                 cvta.shared.u64 %r73, __oacc_bcast;
                 ld.u64  %r67, [%r73];
                 mov.u32 %r62, %ntid.y;
                 mov.u32 %r63, %tid.y;
                 setp.gt.s32     %r68, %r63, 9;
         @%r68   bra     $L2;
                 mov.u32 %r55, %r63;
                 cvt.s64.s32     %r69, %r62;
                 shl.b64 %r59, %r69, 2;
                 cvt.s64.s32     %r70, %r55;
                 shl.b64 %r71, %r70, 2;
                 add.u64 %r58, %r67, %r71;
$L3:
                 st.u32  [%r58], %r55;
                 add.u32 %r55, %r55, %r62;
                 add.u64 %r58, %r58, %r59;
                 setp.le.s32     %r72, %r55, 9;
         @%r72   bra     $L3;
$L2:
                 bar.sync        1;
         // joining 2;
$L5:
         // join 2;
         ret;

Note the bar.sync instructions placed immediately after the forked
comment and before the joining comment. The problem here is that branch
above the forked comment guarantees that the warps are not synchronous

(when vector_length > 1, which is always the case).


This is already advised against in ptx 3.1, so yes, we should fix this.

Likewise, bar.sync
instruction before joining should be placed after label L5 in order to
allow all of the threads in the warp to reach it.


Agreed.

The attached patch teaches the nvptx to make those adjustments.


Can you show me a diff of the ptx for the test-case above for trunk?

It
doesn't cause any regressions on legacy GPUs, but it does resolve quite
a few failures with Volta in the libgomp execution tests.


So, did you test this on trunk?

Therefore,
this patch doesn't include any new test cases.


Makes sense.

> Part of this patch came

from my vector_length patch set that I posted last week. However, that
patch set didn't consider the placement of the joining barrier.

I've applied this patch to openacc-gcc-7-branch.

Tom, is a similar patch OK for trunk? The major difference between trunk
and og7 is that og7 changed the name of nvptx_warp_sync to nvptx_cta_sync.

Please, if you want to have a patch accepted for trunk, then just submita trunk patch.

Cesar


og7-barriers.diff


2018-03-08  Cesar Philippidis  <ce...@codesourcery.com>

        gcc/
        * config/nvptx/nvptx.c (nvptx_single): Adjust placement of nvptx_fork
        and nvptx_join nutering labels.
        (nvptx_process_pars): Place the CTA barrier at the beginning of the
        join block.


diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c
index b16cf59575c..efc6161a6b0 100644
--- a/gcc/config/nvptx/nvptx.c
+++ b/gcc/config/nvptx/nvptx.c
@@ -4056,6 +4056,15 @@ nvptx_single (unsigned mask, basic_block from, 
basic_block to)
        return;
      }

+ /* NVPTX_BARSYNC barriers are placed immediately before NVPTX_JOIN

+     in order to ensure that all of the threads in a CTA reach the
+     barrier.  Don't nueter BLOCK if head is NVPTX_BARSYNC and tail is
+     NVPTX_JOIN.  */
+  if (from == to
+      && recog_memoized (head) == CODE_FOR_nvptx_barsync
+      && recog_memoized (tail) == CODE_FOR_nvptx_join)
+    return;
+
    /* Insert the vector test inside the worker test.  */
    unsigned mode;
    rtx_insn *before = tail;
@@ -4103,7 +4112,17 @@ nvptx_single (unsigned mask, basic_block from, 
basic_block to)
          br = gen_br_true (pred, label);
        else
          br = gen_br_true_uni (pred, label);
-       emit_insn_before (br, head);
+
+       if (recog_memoized (head) == CODE_FOR_nvptx_forked
+           && recog_memoized (NEXT_INSN (head)) == CODE_FOR_nvptx_barsync)
+         {
+           head = NEXT_INSN (head);
+           emit_insn_after (br, head);
+         }
+       else if (recog_memoized (head) == CODE_FOR_nvptx_barsync)
+         emit_insn_after (br, head);
+       else
+         emit_insn_before (br, head);

LABEL_NUSES (label)++;

        if (tail_branch)
@@ -4325,7 +4344,7 @@ nvptx_process_pars (parallel *par)
        {
          /* Insert begin and end synchronizations.  */
          emit_insn_after (nvptx_cta_sync (false), par->forked_insn);
-         emit_insn_before (nvptx_cta_sync (true), par->joining_insn);
+         emit_insn_before (nvptx_cta_sync (true), par->join_insn);
        }
      }
    else if (par->mask & GOMP_DIM_MASK (GOMP_DIM_VECTOR))


Hmm, the patch looks a bit fragile to me.

I wonder it it's possible to do something similar tohttps://gcc.gnu.org/bugzilla/attachment.cgi?id=43480&action=diff


Thanks,
- Tom

Re: [og7] Update nvptx_fork/join barrier placement

Reply via email to