See:

        https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/mgr.c


Circa line 1072 the comment explains:


                        /*
                         * Need to exec() something for proctrack/linuxproc to
                         * work, it will not keep a process named "slurmstepd"
                         */

                        execl(SLEEP_CMD, "sleep", "100000000", NULL);


Basically, proctrack/linuxproc will produce an error if a slurmstepd is running 
zero subprocesses.  So a very long sleep command is spawned to satisfy that 
condition (no matter what proctrack plugin is actually being used).




> On Aug 3, 2018, at 17:42 , Christopher Benjamin Coffey <chris.cof...@nau.edu> 
> wrote:
> 
> Hello,
> 
> Has anyone observed "sleep 100000000" processes on their compute nodes? They 
> seem to be tied to the slurmstepd extern process in slurm:
> 
> 4 S root     136777      1  0  80   0 - 73218 do_wai 05:48 ?        00:00:01 
> slurmstepd: [13220317.extern]
> 0 S root     136782 136777  0  80   0 - 25229 hrtime 05:48 ?        00:00:00  
> \_ sleep 100000000
> 4 S root     136784      1  0  80   0 - 73280 do_wai 05:48 ?        00:00:02 
> slurmstepd: [13220317.batch]
> 4 S tes87    136789 136784  0  80   0 - 26520 do_wai 05:48 ?        00:00:00  
> \_ /bin/bash /var/spool/slurm/slurmd/job13220317/slurm_script
> 4 S root     136807      1  0  80   0 - 107157 do_wai 05:48 ?       00:00:01 
> slurmstepd: [13220317.1]
> 
> I'm not exactly sure what the extern piece is for. Anyone know what this is 
> all about? Is this normal? We just saw this the other day while investigating 
> some issues. Sleeping for 3.17 years seems strange. Any help would be 
> appreciated, thanks!
> 
> Best,
> Chris
> 
> —
> Christopher Coffey
> High-Performance Computing
> Northern Arizona University
> 928-523-1167
> 
> 


Reply via email to