https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85519

            Bug ID: 85519
           Summary: [nvptx, openacc, openmp, testsuite] Recursive tests
                    may fail due to thread stack limit
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vries at gcc dot gnu.org
  Target Milestone: ---

I.

Consider the following openacc program containing a recursive function:
...
#pragma acc routine seq
static int __attribute__((noinline, noclone))
rec (int n)
{
  int res;
  int volatile x = 1;
  if (n <= 1)
    res = 0;
  else
    res = rec (n - 1) + x;
  return res;
}

int
main (void)
{
  int x[0];
  #pragma acc parallel copyout(x) num_gangs(1) num_workers (1)
  x[0] = rec (65);

  return x[0];
}
...


At -O2, we generate the following ptx for rec:
...
.func (.param .u32 %value_out) rec (.param .u32 %in_ar0)
{
  .reg .u32 %value;
  .reg .u32 %ar0;
  ld.param.u32 %ar0,[%in_ar0];
  .local .align 8 .b8 %frame_ar[8];
  .reg .u64 %frame;
  cvta.local.u64 %frame,%frame_ar;
  .reg .u32 %r24;
  .reg .u32 %r25;
  .reg .u32 %r26;
  .reg .u32 %r27;
  .reg .pred %r28;
  .reg .u32 %r30;
  .reg .u32 %r31;
  mov.u32 %r26,%ar0;
  mov.u32 %r27,1;
  st.u32 [%frame],%r27;
  setp.eq.u32 %r28,%r26,1;
  @ %r28 bra $L3;
  add.u32 %r30,%r26,-1;
  {
    .param .u32 %value_in;
    .param .u32 %out_arg1;
    st.param.u32 [%out_arg1],%r30;
    call (%value_in),rec,(%out_arg1);
    ld.param.u32 %r31,[%value_in];
  }
  ld.u32 %r24,[%frame];
  add.u32 %r25,%r31,%r24;
  bra $L1;
 $L3:
  mov.u32 %r25,0;
 $L1:
  mov.u32 %value,%r25;
  st.param.u32 [%value_out],%value;
  ret;
}
...

And for my quadro m1200, we get this SASS for rec (note the 16 byte stack
decrement at the start):
...
rec:
/*0008*/                   IADD32I R1, R1, -0x10;
/*0010*/         {         MOV R0, R4;
/*0018*/                   STL [R1+0x8], R2;        }
/*0028*/                   LOP.OR R2, R1, c[0x0][0x4];
/*0030*/                   LEA R4.CC, R2.reuse, RZ;
/*0038*/                   ISETP.EQ.U32.AND P0, PT, R0, 0x1, PT;
/*0048*/                   LEA.HI.X P1, R5, R2, RZ, RZ;
/*0050*/                   MOV32I R3, 0x1;
/*0058*/                   ST.E [R4], R3, P1;
/*0068*/               @P0 BRA `(.L_1);
/*0070*/                   MOV R4, R0;
/*0078*/                   IADD32I R4, R4, -0x1;
/*0088*/                   CAL `(.text.rec);
/*0090*/                   LEA R6.CC, R2.reuse, RZ;
/*0098*/                   LEA.HI.X P0, R7, R2, RZ, RZ;
/*00a8*/                   LD.E R6, [R6], P0;
/*00b0*/         {         IADD R4, R6, R4;
/*00b8*/                   BRA `(.L_2);        }
.L_1:
/*00c8*/                   MOV R4, RZ;
.L_2:
/*00d0*/                   LDL R2, [R1+0x8];
/*00d8*/         {         IADD32I R1, R1, 0x10;
/*00e8*/                   RET;        }
...

With 65 calls to rec, we get a stack size of 1040, which apparently is close
enough to the 1024 stack limit (found using cudaThreadGetLimit(&size,
cudaLimitStackSize)) to run into trouble:
...
  nvptx_exec: kernel main$_omp_fn$0: finished

libgomp: cuStreamSynchronize error: an illegal memory access was encountered
...

At GOMP_NVPTX_JIT=-O0, we see instead:
...
  nvptx_exec: kernel main$_omp_fn$0: finished

libgomp: cuStreamSynchronize error: an illegal instruction was encountered
...
because rec starts with a valid stack address check:
...
rec:
        /*0008*/                   IADD32I R1, R1, -0x10;
        /*0010*/                   S2R R0, SR_LMEMHIOFF;
        /*0018*/                   ISETP.GE.U32.AND P0, PT, R1, R0, PT;
        /*0028*/               @P0 BRA `(.L_1);
        /*0030*/                   BPT.TRAP 0x1;
.L_1:
...


II.

We have recursive openacc and openmp testcases in the libgomp testsuite, and
when testing on a new architecture we might run into this out-of-stack failure
due to a smaller stack limit or larger frame size of the recursive function.

F.i., I ran into these failures with the og7 branch on titan v (due to the
latter):
...
FAIL: libgomp.fortran/examples-4/declare_target-1.f90   -O1  execution test
FAIL: libgomp.fortran/examples-4/declare_target-1.f90   -O2  execution test
FAIL: libgomp.fortran/examples-4/declare_target-1.f90   -Os  execution test
FAIL: libgomp.fortran/examples-4/declare_target-2.f90   -O1  execution test
FAIL: libgomp.fortran/examples-4/declare_target-2.f90   -O2  execution test
FAIL: libgomp.fortran/examples-4/declare_target-2.f90   -Os  execution test
...

We need to deal with these test-case failures somehow:
- we can reduce the recursion depth
- conversely, we can try to call cudaThreadSetLimit (cudaLimitStackSize, n) in
  the testcase, and increase the limit
- likewise, we can add support for a GOMP_NVPTX_THREADSTACKSIZE=<n> in the
  libgomp nvptx target plugin, and then use this variable in testcases to
  increase the limit

All these solutions work until the next failure shows up. It would be nice to
fix this more definitely in some way, but I'm not sure how.


III.

It's perhaps counter-intuitive that an nvptx openmp exec runs out of stack in a
shallow test-case like declare_target-1.f90 (with recursion depth ~25). The
nvptx openmp implementation uses a warp-specific stack in global memory,
avoiding the use of .local in the generated ptx, and the size of the
warp-specific stack is 128kb per warp. But function parameters may also end up
on stack, making the thread stack still the limiting factor.

Reply via email to