I've encountered that many times, and for me, it was always related to 
AutoDetect and the nvidia-ml library.  Does your slurmd log contain a line like 
"debug:  skipping GRES for NodeName=t-gc-1202  AutoDetect=nvml"?  I see that 
you didn't specifically set AutoDetect to nvml in gres.conf, but maybe you 
should set AutoDetect=off just to be sure.

If "sinfo" shows an "inval" node, then setting them to Resume (not Idle) won't 
work until you figure out why it thinks the node configuration is invalid.

Reply via email to