https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85381
--- Comment #6 from Tom de Vries <vries at gcc dot gnu.org> --- Created attachment 43992 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43992&action=edit tentative patch (In reply to Tom de Vries from comment #4) > This looks like a JIT bug, but with this tentative patch: > ... > diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c > index 8c478c874bd..ac394ee1ae6 100644 > --- a/gcc/config/nvptx/nvptx.c > +++ b/gcc/config/nvptx/nvptx.c > @@ -4479,7 +4479,7 @@ nvptx_process_pars (parallel *par) > threads = nvptx_mach_vector_length (); > } > > - if (!empty || !is_call) > + if (!(empty || is_call)) > { > /* Insert begin and end synchronizations. */ > emit_insn_before (nvptx_cta_sync (barrier, threads), > ... > no barriers are generated, and the minimized testcase passes. Actually, this was a bit more complicated than that. The condition correctly identifies the situation of a call and no state propagation as not needing barriers. But in the case of not a call (so, fork/join), even if there's no state propagation, we need synchronization at the end of worker and vector loops. [ More precisely, we need it inbetween loops, but we're currently not detecting that situation. And even more precisely, we need it inbetween dependent loops, but we also currently not detecting that situation. ] This patch skips the first of the bar.syncs, keeping the one after the loop, and that allows parallel-loop-1.c to pass with vector length 128.