Not a solution, but some ideas & experiences concerning the same topic:
A few of our older GPUs used to show the error message "has fallen off the bus" which was only resolved by a full power cycle as well.
Something changed, nowadays the error messages is "GPU lost" and a normal reboot resolves the problem. This might be a result of an update of the Nvidia drivers (currently 60.73.01), but I can't be sure.
The current behaviour allowed us to write a script checking GPU state every 10 minutes and setting a node to drain&reboot state when such a "lost GPU" is detected.
This has been working well for a couple of months now and saves us time.It might help as well to re-seat all GPUs and PCI risers, this also seemed to help in one of our GPU nodes. Again, I can't be sure, we'd need to do try this with other - still failing - GPUs.
The problem is to identify the cards physically from the information we have, like what's reported with nvidia-smi or available in /proc/driver/nvidia/gpus/*/information The serial number isn't shown for every type of GPU and I'm not sure the ones shown match the stickers on the GPUs. If anybody were to know of a practical solution for this, I'd be happy to read it.
Eventually I'd like to pull out all cards which repeatedly get "lost" and maybe move them all to a node for short debug jobs or throw them away (they're all beyond warranty anyway).
Stephan On 31.01.22 15:45, Timony, Mick wrote:
I have a large compute node with 10 RTX8000 cards at a remote colo. One of the cards on it is acting up "falling of the bus" once a day requiring a full power cycle to reset. I want jobs to avoid that card as well as the card it is NVLINK'ed to. So I modified gres.conf on that node as follows: # cat /etc/slurm/gres.conf AutoDetect=nvml Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1 #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3 #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8 Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9 and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10 to be Gres=gpu:quadro_rtx_8000:8. I restarted slurmctld and slurmd after this. I then put the node back from drain to idle. Jobs were sumbitted and started on the node but they are using the GPU I told it to avoid +--------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |====================================================================| | 0 N/A N/A 63426 C python 11293MiB | | 1 N/A N/A 63425 C python 11293MiB | | 2 N/A N/A 63425 C python 10869MiB | | 2 N/A N/A 63426 C python 10869MiB | | 4 N/A N/A 63425 C python 10849MiB | | 4 N/A N/A 63426 C python 10849MiB | +--------------------------------------------------------------------+ How can I make SLURM not use GPU 2 and 4? --------------------------------------------------------------- Paul Raines http://help.nmr.mgh.harvard.edu <http://help.nmr.mgh.harvard.edu> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging 149 (2301) 13th Street Charlestown, MA 02129 USAYou can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them.This thread on stack overflow explains how to do that:https://unix.stackexchange.com/a/654089/94412 <https://unix.stackexchange.com/a/654089/94412>You can create a script to run at boot and 'drain' the cards. Regards --Mick
smime.p7s
Description: S/MIME Cryptographic Signature