https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932
--- Comment #14 from Tom de Vries <vries at gcc dot gnu.org> --- An observation when playing around with vector-length-128-4.c: there are two ways in which I can make the example pass. 1. add barrier.sync.aligned 0 or membar.cta after first broad-cast receive 2. unroll loop in first broad-cast send. At first glance, it doesn't look entirely trivial though to implement either.