Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering how the below sbatch file is sharing a GPU.
MPS is running on the head node: ps -auwx|grep mps root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d The entire script is posted on SO here <https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app> . Here is the sbatch file contents: #!/bin/sh #SBATCH -N 1 #SBATCH -n 1 #SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval #SBATCH --nodelist=node003 module purge module load gcc5 cuda10.1 module load openmpi/cuda/64 module load pytorch-py36-cuda10.1-gcc module load ml-pythondeps-py36-cuda10.1-gcc python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee alex_100_imwoof_seq_longtrain_cv_$1.txt >From nvidia-smi on the compute node: Processes Process ID : 320467 Type : C Name : python3.6 Used GPU Memory : 2369 MiB Process ID : 320574 Type : C Name : python3.6 Used GPU Memory : 2369 MiB [node003 ~]# nvidia-smi -q -d compute ==============NVSMI LOG============== Timestamp : Fri Apr 3 15:27:49 2020 Driver Version : 440.33.01 CUDA Version : 10.2 Attached GPUs : 1 GPU 00000000:3B:00.0 Compute Mode : Default [~]# nvidia-smi Fri Apr 3 15:28:49 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 42C P0 46W / 250W | 4750MiB / 32510MiB | 32% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 320467 C python3.6 2369MiB | | 0 320574 C python3.6 2369MiB | +-----------------------------------------------------------------------------+ >From htop: 320574 ouruser 20 0 12.2G 1538M 412M R 502. 0.8 14h45:59 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1 320467 ouruser 20 0 12.2G 1555M 412M D 390. 0.8 14h45:13 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0 320654 ouruser 20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0 320656 ouruser 20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0 320658 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1 320660 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1 320661 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1 320655 ouruser 20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0 320657 ouruser 20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0 320659 ouruser 20 0 12.2G 1538M 412M R 55.8 0.8 3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1 Is PyTorch somehow working around Slurm and NOT locking a GPU since the user omitted --gres=gpu:1? How can I tell if MPS is really working?