I'm using this TensorRT tutorial <https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS> with MPS on Slurm 20.02 on Bright Cluster 8.2
I’m trying to use srun to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m writing this node002 and node003 are in use by other users so I just want to use node001. srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out Tue Apr 14 16:45:10 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 67C P0 241W / 250W | 32167MiB / 32510MiB | 100% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 428996 C python3.6 32151MiB | +-----------------------------------------------------------------------------+ Loading openmpi/cuda/64/3.1.4 Loading requirement: hpcx/2.4.0 gcc5/5.5.0 Loading cm-ml-python3deps/3.2.3 Loading requirement: python36 Loading tensorflow-py36-cuda10.1-gcc/1.15.2 Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20 keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6 &&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 [03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt [E] [TRT] CUDA initialization failure with error 999. Please check your CUDA installation: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html [E] Could not create builder. [03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 srun: error: node002: task 0: Exited with exit code 1 So is my syntax wrong with srun? MPS is running: $ ps -auwx|grep mps root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:54 /cm/local/apps/cuda- When node002 is available the program runs correctly, albeit with an error on the log file: srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out Thu Apr 16 10:08:52 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 28C P0 25W / 250W | 41MiB / 32510MiB | 0% E. Process | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 420596 C nvidia-cuda-mps-server 29MiB | +-----------------------------------------------------------------------------+ Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available. An instance of this daemon is already running Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available. Loading openmpi/cuda/64/3.1.4 Loading requirement: hpcx/2.4.0 gcc5/5.5.0 Loading cm-ml-python3deps/3.2.3 Loading requirement: python36 Loading tensorflow-py36-cuda10.1-gcc/1.15.2 Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20 keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6 &&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 [03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt [03/16/2020-10:08:53] [I] Begin parsing model... [03/16/2020-10:08:53] [I] End parsing model... [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and 3 output network tensors. [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine... [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99395 . Duration : 315.744 microseconds. [03/16/2020-10:09:01] [I] Num of users : 2 [03/16/2020-10:09:01] [I] Num of Movies : 100 [03/16/2020-10:09:01] [I] | PID : 99395 | User : 0 | Expected Item : 128 | Predicted Item : 128 | [03/16/2020-10:09:01] [I] | PID : 99395 | User : 1 | Expected Item : 133 | Predicted Item : 133 | [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5 [03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99396 . Duration : 306.944 microseconds. [03/16/2020-10:09:01] [I] Num of users : 2 [03/16/2020-10:09:01] [I] Num of Movies : 100 [03/16/2020-10:09:01] [I] | PID : 99396 | User : 0 | Expected Item : 128 | Predicted Item : 128 | [03/16/2020-10:09:01] [I] | PID : 99396 | User : 1 | Expected Item : 133 | Predicted Item : 133 | [03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run Duration : 4361.73 milliseconds. &&&& PASSED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2 Here are the contents of the mpsmovietest sbatch file: #!/bin/bash #SBATCH --nodes=1 #SBATCH --job-name=MPSMovieTest #SBATCH --gres=gpu:1 #SBATCH --nodelist=node001 #SBATCH --output=mpstest.out export CUDA_VISIBLE_DEVICES=0 nvidia-smi -i 0 export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log nvidia-cuda-mps-control -d module load shared slurm openmpi/cuda/64 cm-ml-python3deps/3.2.3 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3 tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc /cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2 -p 2