I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset.
I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows: # cat /etc/slurm/gres.conf AutoDetect=nvml Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10 to be Gres=gpu:quadro_rtx_8000:8. I restarted slurmctld and slurmd after this. I then put the node back from drain to idle. Jobs were sumbitted and started on the node but they are using the GPU I told it to avoid +--------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |====================================================================| | 0 N/A N/A 63426 C python 11293MiB | | 1 N/A N/A 63425 C python 11293MiB | | 2 N/A N/A 63425 C python 10869MiB | | 2 N/A N/A 63426 C python 10869MiB | | 4 N/A N/A 63425 C python 10849MiB | | 4 N/A N/A 63426 C python 10849MiB | +--------------------------------------------------------------------+ How can I make SLURM not use GPU 2 and 4? --------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USA You can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them. This thread on stack overflow explains how to do that: https://unix.stackexchange.com/a/654089/94412 You can create a script to run at boot and 'drain' the cards. Regards --Mick