https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97384
Bug ID: 97384 Summary: [libgomp, nvptx] Handle -msoft-stack-reserve-local=<n> overflow in plugin Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: vries at gcc dot gnu.org CC: jakub at gcc dot gnu.org Target Milestone: --- Using the option -msoft-stack-reserve-local=<n> results in a: ... .local .align 8 .b8 %simtstack_ar[n+8]; ... However, the CU_LIMIT_STACK_SIZE is set by default to 1kb for my card/driver combo, so if I specify say -msoft-stack-reserve-local=2048, I run into: ... libgomp: cuCtxSynchronize error: an illegal memory access was encountered ... or: ... libgomp: cuCtxSynchronize error: an illegal instruction was encountered ... [ The latter at GOMP_NVPTX_JIT=-O0. ] Which may look a lot like the behaviour we're trying to fix by adding -msoft-stack-reserve-local. There's currently no way to make this work. We could add an env var, say GOMP_NVPTX_LIMIT_STACK_SIZE which is used to set: ... r = cuCtxSetLimit(CU_LIMIT_STACK_SIZE, gomp_nvptx_limit_stack_size); ... and then do: ... $ GOMP_NVPTX_LIMIT_STACK_SIZE=3072 ./a.out ... [ Note that GOMP_NVPTX_LIMIT_STACK_SIZE id chosen to be larger than 2048 to accommodate for other .local usage. ] [ It would be nice if we could attempt to accommodate the requested stack size in the libgomp plugin automatically. In the current setup, that would mean scanning the ptx code for "simtstack_ar[<n>]", which is a bit cumbersome and probably too slow. Perhaps emitting an additional additional line before the pre-amble like this: ... // SIMTSTACK_AR_SIZE: 2048 ... would be possible to handle quick enough. ]