Comparing the Slurm MPS configuration example here <https://slurm.schedmd.com/gres.html#MPS_config_example_2>, our gres.conf has this: NodeName=node[001-003] Name=mps Count=400
What does "Count" really mean and how do you use this number? >From that web page <https://slurm.schedmd.com/gres.html#MPS_Management> you have: "MPS configuration includes only the Name and Count parameters: The count of gres/mps elements will be evenly distributed across all GPUs configured on the node. This is similar to case 1, but places duplicate configuration in the gres.conf file." Also on that page there is this: # Example 1 of gres.conf # Configure support for four GPUs (with MPS) AutoDetect=nvml Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1 Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1 Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3 Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3 # Set gres/mps Count value to 100 on each of the 4 available GPUs Name=mps Count=400 And then this (sidenote, the typo of "*different*" in the example) # Example 2 of gres.conf # Configure support for four *differernt *GPU types (with MPS) AutoDetect=nvml Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1 Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1 Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3 Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3 Name=mps Count=1300 File=/dev/nvidia0 Name=mps Count=1200 File=/dev/nvidia1 Name=mps Count=1100 File=/dev/nvidia2 Name=mps Count=1000 File=/dev/nvidia3 And lower in the page, not sure what "to a job of step" means: The percentage will be calculated based upon the portion of the configured Count on the Gres is allocated to a job of step. For example, a job requesting "--gres=gpu:200" and using configuration example 2 above would be allocated 15% of the gtx1080 (File=/dev/nvidia0, 200 x 100 / 1300 = 15), or 16% of the gtx1070 (File=/dev/nvidia0, 200 x 100 / 1200 = 16), or 18% of the gtx1060 (File=/dev/nvidia0, 200 x 100 / 1100 = 18), or 20% of the gtx1050 (File=/dev/nvidia0, 200 x 100 / 1000 = 20). How were the count values of 1300, 1200, 1100 and 1000 determined? Now segueing to TensorFlow 2 and PyTorch memory greediness. Using the same "Deep Convolutional Generative Adversarial Networks <https://github.com/aymericdamien/TensorFlow-Examples/blob/master/tensorflow_v2/notebooks/3_NeuralNetworks/dcgan.ipynb>" sample script and in my sbatch file I added: #SBATCH --gres=mps:35 echo here is value of TF_FORCE_GPU_ALLOW_GROWTH $TF_FORCE_GPU_ALLOW_GROWTH echo here is the CUDA-MPS-ActiveThread-Percentage $CUDA_MPS_ACTIVE_THREAD_PERCENTAGE So the job log file showed this: here is value of TF_FORCE_GPU_ALLOW_GROWTH true here is the CUDA-MPS-ActiveThread-Percentage 17 So that 17 is half of the 35 I see with the MPS option. The description from the SchedMD page reads: "The percentage will be calculated based upon the portion of the configured Count on the Gres is allocated to a job of step." So how does Count=400 from the gres.conf file factor in? Does it mean the job is using 17% of the available threads of the GPU? From nvidia-smi on this Slurm job: +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | | 0 59793 C python3.6 1135MiB | The GPU has 32 GB: | 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 | | N/A 49C P0 128W / 250W | 3417MiB / 32510MiB | 96% Default | So MPS and the Count option do not help with GPU memory. So I'm trying to find ways to tell our users how to avoid the OOM's. The most common advice is to use smaller batches <https://stackoverflow.com/questions/37736071/tensorflow-out-of-memory> but the complaint we get is it really slows down their jobs doing so. So I just found the section 2 Physical GPUs, 2 Logical GPUs from the TensorFlow 2 <https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth> docs, works by setting a hard limit, in this case 2048 MB, adding the below code after import tensorflow as tf gpus = tf.config.experimental.list_physical_devices('GPU') if gpus: # # Restrict TensorFlow to only allocate 1GB of memory on the first GPU try: tf.config.experimental.set_virtual_device_configuration( gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)]) logical_gpus = tf.config.experimental.list_logical_devices('GPU') print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs") except RuntimeError as e: # # Virtual devices must be set before GPUs have been initialized print(e) I know this is outside of the scope of Slurm but I was hoping someone had a more graceful way rather than a hard memory limit to achieve this. The first option mentioned in the TF docs state: The first option is to turn on memory growth by calling "tf.config.experimental.set_memory_growth, which attempts to allocate only as much GPU memory as needed for the runtime allocations: it starts out allocating very little memory, and as the program gets run and more GPU memory is needed, we extend the GPU memory region allocated to the TensorFlow process. Note we do not release memory, since it can lead to memory fragmentation." I've found using the Recurrent Neural Network Example <https://github.com/aymericdamien/TensorFlow-Examples/blob/master/tensorflow_v2/notebooks/3_NeuralNetworks/recurrent_network.ipynb>, it jumps up to 30 GB: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30486 MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus id: 0000:3b:00.0, compute capability: 7.0) But at least we have a way to deal with our users as we have many TF and PyTorch CNN jobs.