https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85584
--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> --- A quick way to try this out is to remove the add.u64 %y64,%y64,1 in the calculation of %r109, but we run into: ... libgomp: cuStreamSynchronize error: an illegal memory access was encountered ... Things go wrong here at the start: ... @ %r119 bra.uni $L12; @ %r114 bra $L13; mov.u64 %r84,%ar0; mov.u64 %r85,%ar1; // fork 2; cvta.shared.u64 %r113,__oacc_bcast; st.u64 [%r113],%r84; st.u64 [%r113+8],%r85; $L13: $L12: bar.sync 0; @ %r114 bra $L11; // forked 2; cvta.shared.u64 %r112,__oacc_bcast; ld.u64 %r84,[%r112]; ld.u64 %r85,[%r112+8]; mov.u32 %r75,%ntid.y; mov.u32 %r76,%tid.y; setp.gt.s32 %r86,%r76,31; selp.u32 %r117,1,0,%r86; st.u32 [%r109],%r117; $L11: ... - First we broadcast to the generic partition (%r113 stores) - Then we read the broadcast from the generic partition (%r112 loads) - Then we broadcast to the per-worker partition (%r109 store) The problem is that there's nothing to guarantee that the read of the generic broadcast in worker 1 is done before the broadcast to the per-worker partition in worker 0 overwrites it. By inserting a barrier inbetween, the test-case passes: ... @ %r114 bra $L11a; // forked 2; cvta.shared.u64 %r112,__oacc_bcast; ld.u64 %r84,[%r112]; ld.u64 %r85,[%r112+8]; $L11a: bar.sync 0; @ %r114 bra $L11b; mov.u32 %r75,%ntid.y; mov.u32 %r76,%tid.y; setp.gt.s32 %r86,%r76,31; selp.u32 %r117,1,0,%r86; st.u32 [%r109],%r117; $L11b: ...