Hi threre,
I was testing the MPS on Slurm19.05.5 with 4 A100 in compute node. In my opinion, the 4 A100 will be used. But I found that only the first GPU was used. like below: the job script: #!/bin/bash #SBATCH -J date #SBATCH -p NVIDIAA100-PCIE-40GB #SBATCH -n 1 #SBATCH --gres=mps:100 #SBATCH --mem 1024 #SBATCH -o /home/zren/%j.out #SBATCH -e /home/zren/%j.out echo $CUDA_VISIBLE_DEVICES echo $CUDA_MPS_ACTIVE_THREAD_PERCENTAGE ./vectorAdd output of squeue, only one job is running: JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 291 NVIDIAA10 date zren PD 0:00 1 (Resources) 292 NVIDIAA10 date zren PD 0:00 1 (Priority) 293 NVIDIAA10 date zren PD 0:00 1 (Priority) 290 NVIDIAA10 date zren R 0:04 1 mig4 output of nvidia-smi, only 0 index GPU was used: Tue Feb 22 09:47:45 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-PCI... On | 00000000:18:00.0 Off | 0 | | N/A 33C P0 36W / 250W | 415MiB / 40960MiB | 31% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-PCI... On | 00000000:5E:00.0 Off | 0 | | N/A 30C P0 33W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-PCI... On | 00000000:AF:00.0 Off | 0 | | N/A 28C P0 32W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-PCI... On | 00000000:D8:00.0 Off | 0 | | N/A 30C P0 34W / 250W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 10228 C ./vectorAdd 413MiB | +-----------------------------------------------------------------------------+ the configuration of slurm.conf and gres.conf: NodeName=mig4 CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=191907 MemSpecLimit=10240 Gres=gpu:4,mps:400 State=UNKNOWN AutoDetect=nvml Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia0 Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia1 Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia2 Name=gpu Type=nvidia_a100-pcie-40gb File=/dev/nvidia3 Name=mps Count=400 and some logs for job 291 which is in resources state in slurmctld.log: [2022-02-22T09:47:12.890] debug3: _pick_best_nodes: JobId=291 idle_nodes 0 share_nodes 1 [2022-02-22T09:47:12.890] debug2: select/cons_tres: select_p_job_test: evaluating JobId=291 [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: JobId=291 node_mode:Normal alloc_mode:Run_Now [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: node_list:mig4 exc_cores:NONE [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: nodes: min:1 max:500000 requested:1 avail:1 [2022-02-22T09:47:12.890] select/cons_tres: _job_test: evaluating JobId=291 on 1 nodes [2022-02-22T09:47:12.890] select/cons_tres: _job_test: test 0 fail: insufficient resources [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: no job_resources info for JobId=291 rc=-1 [2022-02-22T09:47:12.890] debug2: select/cons_tres: select_p_job_test: evaluating JobId=291 [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: JobId=291 node_mode:Normal alloc_mode:Test_Only [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: node_list:mig4 exc_cores:NONE [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: nodes: min:1 max:500000 requested:1 avail:1 [2022-02-22T09:47:12.890] select/cons_tres: _job_test: evaluating JobId=291 on 1 nodes [2022-02-22T09:47:12.890] select/cons_tres: _can_job_run_on_node: 24 CPUs on mig4(state:1), mem 1024/191907 [2022-02-22T09:47:12.890] select/cons_tres: eval_nodes: set:0 consec CPUs:1 nodes:1:mig4 begin:0 end:0 required:-1 weight:511 [2022-02-22T09:47:12.890] select/cons_tres: _job_test: test 0 pass: test_only [2022-02-22T09:47:12.890] select/cons_tres: select_p_job_test: no job_resources info for JobId=291 rc=0 thanks