[slurm-users] srun: Job step aborted

Niccolo Tosato Thu, 16 Feb 2023 01:06:53 -0800

Hi all,
I'm facing the following issue with a DGX A100 machine: I'm able to allocate 
resources, but the job fail when I try to execute srun, follow a detailed 
analysis of the incident:





```

$ salloc -n1 -N1 -p DEBUG -w dgx001 --time=2:0:0

salloc: Granted job allocation 1278

salloc: Waiting for resource configuration

salloc: Nodes dgx001 are ready for job

$ srun hostname

srun: error: slurm_receive_msgs: [[dgx001.hpc]:6818] failed: Socket timed out 
on send/recv operation

srun: error: Task launch for StepId=1278.0 failed on node dgx001: Socket timed 
out on send/recv operation

srun: error: Application launch failed: Socket timed out on send/recv operation

srun: Job step aborted
```
The DGX Slurm daemon version is:

```

$ slurmd -V

slurm 22.05.8
```
With OS :

```

$ uname -a

Linux dgx001.hpc 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 
x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a

No LSB modules are available.

Distributor ID: Ubuntu

Description: Ubuntu 20.04.5 LTS

Release: 20.04

Codename: focal



```



With cgroup/v2 enabled as follow:

```

$ cat /etc/default/grub | grep cgroup

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=1 cgroup_enable=memory 
swapaccount=1"

```






Daemon status, even if cgroup/v2 is used, still present the process 
`slurmstepd` inside `slurmd.service` (the process 2250748 slurmstepd doesn't 
appear in the other machine under slurmd service)

```

$ systemctl status slurmd 

● slurmd.service - Slurm node daemon

 Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: 
enabled)

 Drop-In: /etc/systemd/system/slurmd.service.d

 └─override.conf

 Active: active (running) since Fri 2023-02-10 14:14:21 CET; 20min ago

 Main PID: 2250012 (slurmd)

 Tasks: 5

 Memory: 10.9M

 CPU: 105ms

 CGroup: /system.slice/slurmd.service

 ├─2250012 /usr/local/sbin/slurmd -D -s -f 
/var/spool/slurm/d/conf-cache/slurm.conf -vvvvvv

 └─2250748 /usr/local/sbin/slurmstepd




```




Also is spawned the expected job in `slurmstepd.scope`:




```

$ systemctl status slurmstepd.scope

● slurmstepd.scope

 Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient)

 Transient: yes

 Active: active (abandoned) since Fri 2023-02-10 14:14:21 CET; 22min ago

 Tasks: 5

 Memory: 1.4M

 CPU: 28ms

 CGroup: /system.slice/slurmstepd.scope

 ├─job_1278

 │ └─step_extern

 │ ├─slurm

 │ │ └─2250609 slurmstepd: [1278.extern]

 │ └─user

 │ └─task_special

 │ └─2250619 sleep 100000000

 └─system

 └─2250024 /usr/local/sbin/slurmstepd infinity




feb 10 14:14:21 dgx001.hpc systemd[1]: Started slurmstepd.scope.

```



The slurm.conf file works without problems with others machines and is also 
tested. 
Follow the service slurmd output:


```
$ journalctl -u slurmd
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug: Waiting for job 
1278's prolog to complete
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug: Finished wait for 
job 1278's prolog to complete

feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: _rpc_launch_tasks: 
call to _forkexec_slurmstepd

feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: slurmstepd rank 0 
(dgx001), parent rank -1 (NONE), children 0, depth 0, max_depth 0

feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: PLUGIN IDX

feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: MPI CONF SEND

feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: error: 
_send_slurmstepd_init failed

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: in the 
service_connection

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug2: Start processing 
RPC: REQUEST_TERMINATE_JOB

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug2: Processing RPC: 
REQUEST_TERMINATE_JOB

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: _rpc_terminate_job: 
uid = 3000 JobId=1278

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 
998: ctime:1675770987 revoked:0 expires:2147483647

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 
1184: ctime:1675953198 revoked:0 expires:2147483647

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 
1217: ctime:1675967394 revoked:0 expires:2147483647

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 
1278: ctime:1676034890 revoked:0 expires:2147483647

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: credential for job 
1278 revoked

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: sent SUCCESS, 
waiting for step to start

feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: Blocked waiting for 
JobId=1278, all steps




```




The function that fail is `_send_slurmstepd_init` at 'req.c:634'

```

if (mpi_conf_send_stepd(fd, job->mpi_plugin_id) !=

 SLURM_SUCCESS){

 debug3("MPI CONF SEND");

 goto rwfail;

}

```




`mpi_conf_send_stepd` fail at `slurm_mpi.c:635`:

```

if ((index = _plugin_idx(plugin_id)) < 0) {

 debug3("PLUGIN IDX");

 goto rwfail;

}

```


Configure settings:

```

./configure --prefix=/usr/local --libdir=/usr/lib64 --enable-pam 
--enable-really-no-cray --enable-shared --enable-x11 --disable-static 
--disable-salloc-background --disable-partial_attach --with-oneapi=no 
--with-shared-libslurm --without-rpath --with-munge --enable-developer 

```

I'm sorry for the hyper-detailed mail, but I've no idea how to cope with the 
issue, thus I hope that all details will be usefull to solve it. 

Thanks in advace,
Niccolo

[slurm-users] srun: Job step aborted

Reply via email to