https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84871
Bug ID: 84871
Summary: libgomp examples-4/declare_target-[12].f90 fail with
nvptx Titan V offloading
Product: gcc
Version: 8.0
Status: UNCONFIRMED
Severity: minor
Priority: P3
Component: libgomp
Assignee: unassigned at gcc dot gnu.org
Reporter: cesar at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
Target Milestone: ---
Both libgomp.fortran/examples-4/declare_target-1.f90 and
libgomp.fortran/examples-4/declare_target-2.f90 fail when offloaded on Nvidia
Titan V (or Volta family) GPUs running Nvida driver 390.25. The failure appears
to be the result of a limited per-CUDA thread stack size of 1024b as collected
by cuCtxGetLimit (..., CU_LIMIT_STACK_SIZE).
Those tests only fail at -O1, -O2 and -Os. Furthermore, all of the tests pass
on older Nvidia GPUs, including Kepler (K80s) and Pascal (GeForce 1080).
One thing I noticed was that ptxas reports that it is spilling more registers
to the stack for the Volta GPUs than it is for Pascal GPUs. Here's the relevant
statistics for Pascal:
ptxas info : Function properties for __e_53_1_mod_MOD_fib
24 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads
Here are the corresponding statistics for Volta:
ptxas info : Function properties for __e_53_1_mod_MOD_fib
40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads
Given that we can't control the PTX driver JIT, maybe we should either reduce
the recursion depth in declare_target-[12].f90 to 20 (actually fib (22) works,
but I don't a newer driver to break it again), or or just xfail those tests for
nvptx targets.
The CUDA driver API provides cuCtxSetLimit function to adjust the stack limit,
but apparently, that only adjusts the upper bound limit.