Hi,
I just upgraded my cluster form 19.05 to 20.02.
Now, in the prolog/epilog scripts, the variables SLURM_JOB_GPUS,
CUDA_VISIBLE_DEVICES and GPU_DEVICE_ORDINAL are missing.
I am setting access to the GPUs via cgroups.
The only variables in prolog available are
SLURMD_NODENAME
SLURM_CLUSTER_
I posted this yesterday and this does appear to be related to a specific
job. Note this error: "gres/gpu: count changed for node node002 from 0 to
1" Could it be misleading? What could cause the node to drain? Here are the
contents of the user's SBATCH file. Could the piping having an effect here?