https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81069
--- Comment #1 from Thomas Schwinge <tschwinge at gcc dot gnu.org> --- Created attachment 41548 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41548&action=edit nested-function-1_.f90 Note that I'm not saying that the problem is actually caused by trunk r239357; quite likely it's latent, and rather just uncovered by some code generation changes due to that commit. I have done most of the following analysis on a merge of trunk r239357 into gomp-4-branch r240825. However, as I should notice later, the problem likewise appears with/is visible on trunk r239357, if you explicitly enable worker/vector parallelism usage: --- libgomp/testsuite/libgomp.oacc-fortran/nested-function-1.f90 +++ libgomp/testsuite/libgomp.oacc-fortran/nested-function-1.f90 @@ -38,7 +38,7 @@ firstdo: do i = 1, 3 ! execution mode into something like gang-single. !$acc loop gang(static:1) collapse(1) do 115 k=1,3 - !$acc loop collapse(2) + !$acc loop worker vector collapse(2) dokk: do kk=1,3 do kkk=1,3 a(k,kk,kkk) = 1 @@ -53,7 +53,7 @@ firstdo: do i = 1, 3 ! execution mode into something like gang-single. !$acc loop gang(static:1) collapse(1) dol: do 120 l=1,3 - !$acc loop collapse(2) + !$acc loop worker vector collapse(2) doll: do ll=1,3 do lll=1,3 a(l,ll,lll) = 2 So, it should be reasonable to further analyze this on a current GCC trunk revision. In the following I'm refering to a simplified test case: "nested-function-1_.f90", as attached, which still shows the execution failure with "-O1 -fopenacc". I use "-foffload=nvptx-none=-save-temps", and then examine the smaller of the two "cc*.s" files generated. The problem seems to be rather fragile; for example, translating the Fortran "any" intrinsic into corresponding Fortran loop code, the problem is no longer visible. Note that per the patch above, this region of code needs to be gang-parallel, but worker-single (W0 active), vector-single (V0 active). I did not yet try to translate this code into C. Editing out any insignificant code changes, only the following diff remains in the PTX code generated without/with trunk r239357: --- [without trunk r239357] +++ [with trunk r239357] @@ -286,38 +286,36 @@ $L22: $L21: bar.sync 0; ld.shared.u32 %r169, [__worker_bcast]; setp.ne.u32 %r151, %r169, 0; bar.sync 1; @!%r151 bra.uni $L16; - bra $L6; +$L6: + @%r162 bra.uni $L25; + @%r163 bra $L26; + { + call _gfortran_abort; + trap; // (noreturn) + } +$L26: +$L25: $L8: add.u32 %R7, %R7, %r87; add.u32 %r152, %R7, -1; setp.le.s32 %r154, %r152, %r158; selp.u32 %r164, 1, 0, %r154; st.shared.u32 [__worker_bcast], %r164; $L18: $L17: bar.sync 0; ld.shared.u32 %r165, [__worker_bcast]; setp.ne.u32 %r154, %r165, 0; bar.sync 1; @%r154 bra.uni $L11; - bra $L1; -$L6: - @%r162 bra.uni $L25; - @%r163 bra $L26; - { - call _gfortran_abort; - trap; // (noreturn) - } -$L26: -$L25: $L1: ret; } While that is still very similar, there is one difference in the trunk r239357 code: the "abort" call is neutered/skipped for !W0 (%r162) and !V0 (%r163), but then these "follow-along" threads run into the "__worker_bcast st.shared", which to my understanding is supposed to be executed by W0, V0 only. I have not verified it explicitly, but this must be where the problem is, given that's the only remaining difference in the generated code. With trunk r239357, the problem disappears when disabling the "Neuter whole SESE regions" optimization: --- gcc/config/nvptx/nvptx.c +++ gcc/config/nvptx/nvptx.c @@ -3719,7 +3719,7 @@ nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer) { int ix, len; - if (nvptx_optimize) + if (0 && nvptx_optimize) { /* Neuter whole SESE regions. */ bb_pair_vec_t regions; Editing out any insignificant code changes, only the following diff remains in the PTX code generated with that optimization disabled/enabled: --- ["Neuter whole SESE regions" disabled] +++ ["Neuter whole SESE regions" enabled (default)] @@ -253,20 +253,22 @@ setp.eq.u64 %r145, %r99, %r155; selp.u32 %r166, 1, 0, %r145; st.shared.u32 [__worker_bcast], %r166; $L20: $L19: bar.sync 0; ld.shared.u32 %r167, [__worker_bcast]; setp.ne.u32 %r145, %r167, 0; bar.sync 1; @!%r145 bra.uni $L9; + @%r162 bra.uni $L17; + @%r163 bra $L18; bra $L8; $L15: @%r162 bra.uni $L23; @%r163 bra $L24; ld.u32 %r146, [%r93+12]; setp.ne.u32 %r148, %r146, %r142; selp.u32 %r170, 1, 0, %r148; st.shared.u32 [__worker_bcast], %r170; $L24: $L23: @@ -291,22 +293,20 @@ $L6: @%r162 bra.uni $L25; @%r163 bra $L26; { call _gfortran_abort; trap; // (noreturn) } $L26: $L25: $L8: - @%r162 bra.uni $L17; - @%r163 bra $L18; add.u32 %r56, %r56, %r87; add.u32 %r152, %r56, -1; setp.le.s32 %r154, %r152, %r158; selp.u32 %r164, 1, 0, %r154; st.shared.u32 [__worker_bcast], %r164; $L18: $L17: bar.sync 0; ld.shared.u32 %r165, [__worker_bcast]; setp.ne.u32 %r154, %r165, 0; As above, with "Neuter whole SESE regions" enabled (default), the "__worker_bcast st.shared" is no longer being neutered/skipped for the !W0, !V0 "follow-along" threads. Again, even though I have not explicitly verified that, this must be where the problem is, given that's then only difference in the generated code. The "Neuter whole SESE regions" code is an optimization, discussed in <http://mid.mail-archive.com/55F34C51.4090908@acm.org> (gomp-4_0-branch), or <http://mid.mail-archive.com/564CC75D.3020309@acm.org> (trunk). With trunk r239357, the problem disappears when instead of "call abort" (which is known as "noreturn") using something like "call backtrace" (not "noreturn"). (That won't link due to PTX "unresolved symbol _gfortran_backtrace", but enough to examine the generated code.) The generated PTX code is substantially different. I did notice that in the "abort"/"noreturn" case, we run into the "node leads [...] to a no-return region" case in gcc/config/nvptx/nvptx.c:nvptx_sese_pseudo, but I have not examined that any further. Is something going wrong in the "Neuter whole SESE regions" optimization, possibly in combination with "noreturn" function calls?