[slurm-users] PyTorch with Slurm and MPS work-around --gres=gpu:1?

Robert Kudyba Fri, 03 Apr 2020 12:47:39 -0700

Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering
how the below sbatch file is sharing a GPU.


MPS is running on the head node:
ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:27
/cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d

The entire script is posted on SO here
<https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app>
.

Here is the sbatch file contents:

#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee
alex_100_imwoof_seq_longtrain_cv_$1.txt

>From nvidia-smi on the compute node:
    Processes
        Process ID                  : 320467
            Type                    : C
            Name                    : python3.6
            Used GPU Memory         : 2369 MiB
        Process ID                  : 320574
            Type                    : C
            Name                    : python3.6
            Used GPU Memory         : 2369 MiB

[node003 ~]# nvidia-smi -q -d compute

==============NVSMI LOG==============

Timestamp                           : Fri Apr  3 15:27:49 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:3B:00.0
    Compute Mode                    : Default


[~]# nvidia-smi
Fri Apr  3 15:28:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |
 0 |
| N/A   42C    P0    46W / 250W |   4750MiB / 32510MiB |     32%
 Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
   |
|=============================================================================|
|    0    320467      C   python3.6
2369MiB |
|    0    320574      C   python3.6
2369MiB |
+-----------------------------------------------------------------------------+

>From htop:
320574 ouruser 20   0 12.2G 1538M  412M R 502.  0.8 14h45:59 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20   0 12.2G 1555M  412M D 390.  0.8 14h45:13 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20   0 12.2G 1538M  412M R 55.8  0.8  3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1

Is PyTorch somehow working around Slurm and NOT locking a GPU since the
user omitted --gres=gpu:1? How can I tell if MPS is really working?

[slurm-users] PyTorch with Slurm and MPS work-around --gres=gpu:1?

Reply via email to