https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104893
--- Comment #1 from Tom de Vries <vries at gcc dot gnu.org> --- (In reply to Tom de Vries from comment #0) > The per-thread call stack is handled for .local memory by the CUDA driver. > > For the 'soft stack' that's not the case. Hmm, actually there's .local memory used, just not "directly". Possibly the documentation needs updating to point that out. So, there doesn't seem to be an issue related to overlapping storage. So I wonder, is the stack pointer also per thread then? Or still per-warp?