https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99932
--- Comment #13 from Tom de Vries <vries at gcc dot gnu.org> --- (In reply to Tom de Vries from comment #10) > [ FTR, T400, driver 470.94 ] > > Interestingly, changing the default ptx version to 6.3 makes the minimal > test-case pass, as well as the full parallel-dims.c > > The only code changes are shfl -> shfl.sync and vote -> vote.sync. > It seems another change is required. Starting with 6.0, bar.sync maps onto barrier.sync.aligned, where the aligned means that "all threads in CTA will execute the same barrier instruction. In conditionally executed code, an aligned barrier instruction should only be used if it is known that all threads in CTA evaluate the condition identically, otherwise behavior is undefined." It's not fully clear what is meant with "the same barrier instruction" or "condition", but in the case of vector_length > 32, we use: ... bar.sync %r67,64; ... where %r67 is a barrier number, 1 for worker 0 and 2 for worker 1 in case of 2 workers. It may well be that it's invalid to use bar.sync for this, and we should use barrier.sync instead. But then there's an isa note: ... Note: For .target sm_6x or below, 1. barrier instruction without .aligned modifier is equivalent to .aligned variant and has the same restrictions as of .aligned variant. ... which seems to imply that we get back barrier.sync.aligned behaviour for sm_6x and earlier, which would again break vector_length > 32.