https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81069

--- Comment #1 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Created attachment 41548
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41548&action=edit
nested-function-1_.f90

Note that I'm not saying that the problem is actually caused by trunk r239357;
quite likely it's latent, and rather just uncovered by some code generation
changes due to that commit.

I have done most of the following analysis on a merge of trunk r239357 into
gomp-4-branch r240825.  However, as I should notice later, the problem likewise
appears with/is visible on trunk r239357, if you explicitly enable
worker/vector parallelism usage:

    --- libgomp/testsuite/libgomp.oacc-fortran/nested-function-1.f90
    +++ libgomp/testsuite/libgomp.oacc-fortran/nested-function-1.f90
    @@ -38,7 +38,7 @@ firstdo: do i = 1, 3
         ! execution mode into something like gang-single.
         !$acc loop gang(static:1) collapse(1)
           do 115 k=1,3
    -         !$acc loop collapse(2)
    +         !$acc loop worker vector collapse(2)
       dokk: do kk=1,3
               do kkk=1,3
                 a(k,kk,kkk) = 1
    @@ -53,7 +53,7 @@ firstdo: do i = 1, 3
         ! execution mode into something like gang-single.
         !$acc loop gang(static:1) collapse(1)
      dol: do 120 l=1,3
    -    !$acc loop collapse(2)
    +    !$acc loop worker vector collapse(2)
       doll: do ll=1,3
               do lll=1,3
                 a(l,ll,lll) = 2

So, it should be reasonable to further analyze this on a current GCC trunk
revision.

In the following I'm refering to a simplified test case:
"nested-function-1_.f90", as attached, which still shows the execution failure
with "-O1 -fopenacc".  I use "-foffload=nvptx-none=-save-temps", and then
examine the smaller of the two "cc*.s" files generated.

The problem seems to be rather fragile; for example, translating the Fortran
"any" intrinsic into corresponding Fortran loop code, the problem is no longer
visible.  Note that per the patch above, this region of code needs to be
gang-parallel, but worker-single (W0 active), vector-single (V0 active).

I did not yet try to translate this code into C.


Editing out any insignificant code changes, only the following diff remains in
the PTX code generated without/with trunk r239357:

    --- [without trunk r239357]
    +++ [with trunk r239357]
    @@ -286,38 +286,36 @@
     $L22:
     $L21:
                    bar.sync        0;
                    ld.shared.u32   %r169, [__worker_bcast];
                    setp.ne.u32     %r151, %r169, 0;
                    bar.sync        1;
            @!%r151 bra.uni $L16;
    -               bra     $L6;
    +$L6:
    +       @%r162  bra.uni $L25;
    +       @%r163  bra     $L26;
    +       {
    +               call _gfortran_abort;
    +               trap; // (noreturn)
    +       }
    +$L26:
    +$L25:
     $L8:
                    add.u32 %R7, %R7, %r87;
                    add.u32 %r152, %R7, -1;
                    setp.le.s32     %r154, %r152, %r158;
                    selp.u32        %r164, 1, 0, %r154;
                    st.shared.u32   [__worker_bcast], %r164;
     $L18:
     $L17:
                    bar.sync        0;
                    ld.shared.u32   %r165, [__worker_bcast];
                    setp.ne.u32     %r154, %r165, 0;
                    bar.sync        1;
            @%r154  bra.uni $L11;
    -               bra     $L1;
    -$L6:
    -       @%r162  bra.uni $L25;
    -       @%r163  bra     $L26;
    -       {
    -               call _gfortran_abort;
    -               trap; // (noreturn)
    -       }
    -$L26:
    -$L25:
     $L1:
            ret;
     }

While that is still very similar, there is one difference in the trunk r239357
code: the "abort" call is neutered/skipped for !W0 (%r162) and !V0 (%r163), but
then these "follow-along" threads run into the "__worker_bcast st.shared",
which to my understanding is supposed to be executed by W0, V0 only.  I have
not verified it explicitly, but this must be where the problem is, given that's
the only remaining difference in the generated code.


With trunk r239357, the problem disappears when disabling the "Neuter whole
SESE regions" optimization:

    --- gcc/config/nvptx/nvptx.c
    +++ gcc/config/nvptx/nvptx.c
    @@ -3719,7 +3719,7 @@ nvptx_neuter_pars (parallel *par, unsigned modes,
unsigned outer)
         {
           int ix, len;

    -      if (nvptx_optimize)
    +      if (0 && nvptx_optimize)
            {
              /* Neuter whole SESE regions.  */
              bb_pair_vec_t regions;

Editing out any insignificant code changes, only the following diff remains in
the PTX code generated with that optimization disabled/enabled:

    --- ["Neuter whole SESE regions" disabled]
    +++ ["Neuter whole SESE regions" enabled (default)]
    @@ -253,20 +253,22 @@
                    setp.eq.u64     %r145, %r99, %r155;
                    selp.u32        %r166, 1, 0, %r145;
                    st.shared.u32   [__worker_bcast], %r166;
     $L20:
     $L19:
                    bar.sync        0;
                    ld.shared.u32   %r167, [__worker_bcast];
                    setp.ne.u32     %r145, %r167, 0;
                    bar.sync        1;
            @!%r145 bra.uni $L9;
    +       @%r162  bra.uni $L17;
    +       @%r163  bra     $L18;
                    bra     $L8;
     $L15:
            @%r162  bra.uni $L23;
            @%r163  bra     $L24;
                    ld.u32  %r146, [%r93+12];
                    setp.ne.u32     %r148, %r146, %r142;
                    selp.u32        %r170, 1, 0, %r148;
                    st.shared.u32   [__worker_bcast], %r170;
     $L24:
     $L23:
    @@ -291,22 +293,20 @@
     $L6:
            @%r162  bra.uni $L25;
            @%r163  bra     $L26;
            {
                    call _gfortran_abort;
                    trap; // (noreturn)
            }
     $L26:
     $L25:
     $L8:
    -       @%r162  bra.uni $L17;
    -       @%r163  bra     $L18;
                    add.u32 %r56, %r56, %r87;
                    add.u32 %r152, %r56, -1;
                    setp.le.s32     %r154, %r152, %r158;
                    selp.u32        %r164, 1, 0, %r154;
                    st.shared.u32   [__worker_bcast], %r164;
     $L18:
     $L17:
                    bar.sync        0;
                    ld.shared.u32   %r165, [__worker_bcast];
                    setp.ne.u32     %r154, %r165, 0;

As above, with "Neuter whole SESE regions" enabled (default), the
"__worker_bcast st.shared" is no longer being neutered/skipped for the !W0, !V0
"follow-along" threads.  Again, even though I have not explicitly verified
that, this must be where the problem is, given that's then only difference in
the generated code.

The "Neuter whole SESE regions" code is an optimization, discussed in
<http://mid.mail-archive.com/55F34C51.4090908@acm.org> (gomp-4_0-branch), or
<http://mid.mail-archive.com/564CC75D.3020309@acm.org> (trunk).


With trunk r239357, the problem disappears when instead of "call abort" (which
is known as "noreturn") using something like "call backtrace" (not "noreturn").
 (That won't link due to PTX "unresolved symbol _gfortran_backtrace", but
enough to examine the generated code.)  The generated PTX code is substantially
different.

I did notice that in the "abort"/"noreturn" case, we run into the "node leads
[...] to a no-return region" case in
gcc/config/nvptx/nvptx.c:nvptx_sese_pseudo, but I have not examined that any
further.


Is something going wrong in the "Neuter whole SESE regions" optimization,
possibly in combination with "noreturn" function calls?

Reply via email to