https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85584

--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> ---
A quick way to try this out is to remove the add.u64 %y64,%y64,1 in the
calculation of %r109, but we run into:
...
libgomp: cuStreamSynchronize error: an illegal memory access was encountered
...

Things go wrong here at the start:
...
  @ %r119 bra.uni $L12;
  @ %r114 bra $L13;
  mov.u64 %r84,%ar0;
  mov.u64 %r85,%ar1;
  // fork 2;                                                                    
  cvta.shared.u64 %r113,__oacc_bcast;
  st.u64 [%r113],%r84;
  st.u64 [%r113+8],%r85;
 $L13:
 $L12:

  bar.sync 0;

  @ %r114 bra $L11;
  // forked 2;                                                                  
  cvta.shared.u64 %r112,__oacc_bcast;
  ld.u64 %r84,[%r112];
  ld.u64 %r85,[%r112+8];
  mov.u32 %r75,%ntid.y;
  mov.u32 %r76,%tid.y;
  setp.gt.s32 %r86,%r76,31;
  selp.u32 %r117,1,0,%r86;
  st.u32 [%r109],%r117;
 $L11:
...

- First we broadcast to the generic partition (%r113 stores)
- Then we read the broadcast from the generic partition (%r112 loads)
- Then we broadcast to the per-worker partition (%r109 store)

The problem is that there's nothing to guarantee that the read of the generic
broadcast in worker 1 is done before the broadcast to the per-worker partition
in worker 0 overwrites it.

By inserting a barrier inbetween, the test-case passes:
...
  @ %r114 bra $L11a;
  // forked 2;                                                                  
  cvta.shared.u64 %r112,__oacc_bcast;
  ld.u64 %r84,[%r112];
  ld.u64 %r85,[%r112+8];
 $L11a:

  bar.sync 0;

  @ %r114 bra $L11b;
  mov.u32 %r75,%ntid.y;
  mov.u32 %r76,%tid.y;
  setp.gt.s32 %r86,%r76,31;
  selp.u32 %r117,1,0,%r86;
  st.u32 [%r109],%r117;
 $L11b:
...

Reply via email to