https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104916
--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> --- We could try the same solution as for atomic: predicate ld/st to only execute in lane 0, and propagate ld result. Another solution might be to wrap each ld/st in two bar.warp.sync.