https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85519
Bug ID: 85519 Summary: [nvptx, openacc, openmp, testsuite] Recursive tests may fail due to thread stack limit Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org Target Milestone: --- I. Consider the following openacc program containing a recursive function: ... #pragma acc routine seq static int __attribute__((noinline, noclone)) rec (int n) { int res; int volatile x = 1; if (n <= 1) res = 0; else res = rec (n - 1) + x; return res; } int main (void) { int x[0]; #pragma acc parallel copyout(x) num_gangs(1) num_workers (1) x[0] = rec (65); return x[0]; } ... At -O2, we generate the following ptx for rec: ... .func (.param .u32 %value_out) rec (.param .u32 %in_ar0) { .reg .u32 %value; .reg .u32 %ar0; ld.param.u32 %ar0,[%in_ar0]; .local .align 8 .b8 %frame_ar[8]; .reg .u64 %frame; cvta.local.u64 %frame,%frame_ar; .reg .u32 %r24; .reg .u32 %r25; .reg .u32 %r26; .reg .u32 %r27; .reg .pred %r28; .reg .u32 %r30; .reg .u32 %r31; mov.u32 %r26,%ar0; mov.u32 %r27,1; st.u32 [%frame],%r27; setp.eq.u32 %r28,%r26,1; @ %r28 bra $L3; add.u32 %r30,%r26,-1; { .param .u32 %value_in; .param .u32 %out_arg1; st.param.u32 [%out_arg1],%r30; call (%value_in),rec,(%out_arg1); ld.param.u32 %r31,[%value_in]; } ld.u32 %r24,[%frame]; add.u32 %r25,%r31,%r24; bra $L1; $L3: mov.u32 %r25,0; $L1: mov.u32 %value,%r25; st.param.u32 [%value_out],%value; ret; } ... And for my quadro m1200, we get this SASS for rec (note the 16 byte stack decrement at the start): ... rec: /*0008*/ IADD32I R1, R1, -0x10; /*0010*/ { MOV R0, R4; /*0018*/ STL [R1+0x8], R2; } /*0028*/ LOP.OR R2, R1, c[0x0][0x4]; /*0030*/ LEA R4.CC, R2.reuse, RZ; /*0038*/ ISETP.EQ.U32.AND P0, PT, R0, 0x1, PT; /*0048*/ LEA.HI.X P1, R5, R2, RZ, RZ; /*0050*/ MOV32I R3, 0x1; /*0058*/ ST.E [R4], R3, P1; /*0068*/ @P0 BRA `(.L_1); /*0070*/ MOV R4, R0; /*0078*/ IADD32I R4, R4, -0x1; /*0088*/ CAL `(.text.rec); /*0090*/ LEA R6.CC, R2.reuse, RZ; /*0098*/ LEA.HI.X P0, R7, R2, RZ, RZ; /*00a8*/ LD.E R6, [R6], P0; /*00b0*/ { IADD R4, R6, R4; /*00b8*/ BRA `(.L_2); } .L_1: /*00c8*/ MOV R4, RZ; .L_2: /*00d0*/ LDL R2, [R1+0x8]; /*00d8*/ { IADD32I R1, R1, 0x10; /*00e8*/ RET; } ... With 65 calls to rec, we get a stack size of 1040, which apparently is close enough to the 1024 stack limit (found using cudaThreadGetLimit(&size, cudaLimitStackSize)) to run into trouble: ... nvptx_exec: kernel main$_omp_fn$0: finished libgomp: cuStreamSynchronize error: an illegal memory access was encountered ... At GOMP_NVPTX_JIT=-O0, we see instead: ... nvptx_exec: kernel main$_omp_fn$0: finished libgomp: cuStreamSynchronize error: an illegal instruction was encountered ... because rec starts with a valid stack address check: ... rec: /*0008*/ IADD32I R1, R1, -0x10; /*0010*/ S2R R0, SR_LMEMHIOFF; /*0018*/ ISETP.GE.U32.AND P0, PT, R1, R0, PT; /*0028*/ @P0 BRA `(.L_1); /*0030*/ BPT.TRAP 0x1; .L_1: ... II. We have recursive openacc and openmp testcases in the libgomp testsuite, and when testing on a new architecture we might run into this out-of-stack failure due to a smaller stack limit or larger frame size of the recursive function. F.i., I ran into these failures with the og7 branch on titan v (due to the latter): ... FAIL: libgomp.fortran/examples-4/declare_target-1.f90 -O1 execution test FAIL: libgomp.fortran/examples-4/declare_target-1.f90 -O2 execution test FAIL: libgomp.fortran/examples-4/declare_target-1.f90 -Os execution test FAIL: libgomp.fortran/examples-4/declare_target-2.f90 -O1 execution test FAIL: libgomp.fortran/examples-4/declare_target-2.f90 -O2 execution test FAIL: libgomp.fortran/examples-4/declare_target-2.f90 -Os execution test ... We need to deal with these test-case failures somehow: - we can reduce the recursion depth - conversely, we can try to call cudaThreadSetLimit (cudaLimitStackSize, n) in the testcase, and increase the limit - likewise, we can add support for a GOMP_NVPTX_THREADSTACKSIZE=<n> in the libgomp nvptx target plugin, and then use this variable in testcases to increase the limit All these solutions work until the next failure shows up. It would be nice to fix this more definitely in some way, but I'm not sure how. III. It's perhaps counter-intuitive that an nvptx openmp exec runs out of stack in a shallow test-case like declare_target-1.f90 (with recursion depth ~25). The nvptx openmp implementation uses a warp-specific stack in global memory, avoiding the use of .local in the generated ptx, and the size of the warp-specific stack is 128kb per warp. But function parameters may also end up on stack, making the thread stack still the limiting factor.