Re: [slurm-users] srun: Job step aborted

2023-03-01 Thread Niccolo Tosato
I finally solved the issue, my slurm client on computational node was build and configured with pmix_v3 as follow: $:/usr/local/lib/slurm$ ll | grep pmix lrwxrwxrwx 1 root root   16 feb 23 15:57 mpi_pmix.so -> ./mpi_pmix_v3.so* -rwxr-xr-x 1 root root 1003 feb 23 15:57 mpi_pmix_v3.la* -r

[slurm-users] srun: Job step aborted

2023-02-16 Thread Niccolo Tosato
Hi all, I'm facing the following issue with a DGX A100 machine: I'm able to allocate resources, but the job fail when I try to execute srun, follow a detailed analysis of the incident: ``` $ salloc -n1 -N1 -p DEBUG -w dgx001 --time=2:0:0 salloc: Granted job allocation 1278 salloc: Waiting