Re: [slurm-users] job startup timeouts?

Riebs, Andy Thu, 02 May 2019 05:47:34 -0700

This proved to be a scaling problem in PMIX; thanks to Artem Polyakov for 
tracking this down (and submitting a 
fix<https://bugs.schedmd.com/show_bug.cgi?id=6932>).

Thanks for all the suggestions folks!

Andy

From: Riebs, Andy
Sent: Friday, April 26, 2019 11:24 AM
To: slurm-users@lists.schedmd.com
Subject: Re: [slurm-users] job startup timeouts?

Hi John,

> It's a DNS problem, isn't it?   Seriously though - how long does srun 
> hostname take for a single system?

We're running nscd on all nodes, with an extremely stable list of 
users/accounts, so I think we should be good here.

"time srun hostname" reports on the order of 0.2 seconds, so at least single 
node requests are handled expediently!

Andy
________________________________
From: John Hearns <hear...@googlemail.com><mailto:hear...@googlemail.com>
Sent: Friday, April 26, 2019 10:56AM
To: Slurm User Community List 
<slurm-users@lists.schedmd.com><mailto:slurm-users@lists.schedmd.com>
Cc:
Subject: Re: [slurm-users] job startup timeouts?
It's a DNS problem, isn't it?   Seriously though - how long does srun hostname 
take for a single system?

On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen 
<dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov>> wrote:
We have 12,000 nodes in our system, 9,600 of which are KNL.  We can
start a parallel application within a few seconds in most cases (when
the machine is dedicated to this task), even at full scale.  So I
don't think there is anything intrinsic to Slurm that would
necessarily be limiting you, though we have seen cases in the past
where arbitrary task distribution has caused contoller slow-down
issues as the detailed scheme was parsed.

Do you know if all the slurmstepd's are starting quickly on the
compute nodes?  How is the OS/Slurm/executable delivered to the node?
----
Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
Acting Group Lead, Computational Systems Group
National Energy Research Scientific Computing Center
dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov>

------------- __o
---------- _ '\<,_
----------(_)/  (_)__________________________

On Fri, Apr 26, 2019 at 7:40 AM Riebs, Andy 
<andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
>
> Thanks for the quick response Doug!
>
> Unfortunately, I can't be specific about the cluster size, other than to say 
> it's got more than a thousand nodes.
>
> In a separate test that I had missed, even "srun hostname" took 5 minutes to 
> run. So there was no remote file system or MPI involvement.
>
> Andy
>
> -----Original Message-----
> From: slurm-users 
> [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>]
>  On Behalf Of Douglas Jacobsen
> Sent: Friday, April 26, 2019 9:24 AM
> To: Slurm User Community List 
> <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>>
> Subject: Re: [slurm-users] job startup timeouts?
>
> How large is very large?  Where is the executable being started?  In
> the parallel filesystem/NFS?  If that is the case you may be able to
> trim start times by using sbcast to transfer the executable (and its
> dependencies if dynamically linked) into a node-local resource, such
> as /tmp or /dev/shm depending on your local configuration.
> ----
> Doug Jacobsen, Ph.D.
> NERSC Computer Systems Engineer
> Acting Group Lead, Computational Systems Group
> National Energy Research Scientific Computing Center
> dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov>
>
> ------------- __o
> ---------- _ '\<,_
> ----------(_)/  (_)__________________________
>
>
> On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs 
> <andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote:
> >
> > Hi All,
> >
> > We've got a very large x86_64 cluster with lots of cores on each node, and 
> > hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on 
> > CentOS 7.6.
> >
> > We have a job that reports
> >
> > srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks
> > srun: Job step 291963.0 aborted before step completely launched.
> >
> > when we try to run it at large scale. We anticipate that it could take as 
> > long as 15 minutes for the job to launch, based on our experience with 
> > smaller numbers of nodes.
> >
> > Is there a timeout setting that we're missing that can be changed to 
> > accommodate a lengthy startup time like this?
> >
> > Andy
> >
> > --
> >
> > Andy Riebs
> > andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>
> > Hewlett-Packard Enterprise
> > High Performance Computing Software Engineering
> > +1 404 648 9024
> > My opinions are not necessarily those of HPE
> >     May the source be with you!
>

Re: [slurm-users] job startup timeouts?

Reply via email to