This proved to be a scaling problem in PMIX; thanks to Artem Polyakov for tracking this down (and submitting a fix<https://bugs.schedmd.com/show_bug.cgi?id=6932>).
Thanks for all the suggestions folks! Andy From: Riebs, Andy Sent: Friday, April 26, 2019 11:24 AM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] job startup timeouts? Hi John, > It's a DNS problem, isn't it? Seriously though - how long does srun > hostname take for a single system? We're running nscd on all nodes, with an extremely stable list of users/accounts, so I think we should be good here. "time srun hostname" reports on the order of 0.2 seconds, so at least single node requests are handled expediently! Andy ________________________________ From: John Hearns <hear...@googlemail.com><mailto:hear...@googlemail.com> Sent: Friday, April 26, 2019 10:56AM To: Slurm User Community List <slurm-users@lists.schedmd.com><mailto:slurm-users@lists.schedmd.com> Cc: Subject: Re: [slurm-users] job startup timeouts? It's a DNS problem, isn't it? Seriously though - how long does srun hostname take for a single system? On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen <dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov>> wrote: We have 12,000 nodes in our system, 9,600 of which are KNL. We can start a parallel application within a few seconds in most cases (when the machine is dedicated to this task), even at full scale. So I don't think there is anything intrinsic to Slurm that would necessarily be limiting you, though we have seen cases in the past where arbitrary task distribution has caused contoller slow-down issues as the detailed scheme was parsed. Do you know if all the slurmstepd's are starting quickly on the compute nodes? How is the OS/Slurm/executable delivered to the node? ---- Doug Jacobsen, Ph.D. NERSC Computer Systems Engineer Acting Group Lead, Computational Systems Group National Energy Research Scientific Computing Center dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov> ------------- __o ---------- _ '\<,_ ----------(_)/ (_)__________________________ On Fri, Apr 26, 2019 at 7:40 AM Riebs, Andy <andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote: > > Thanks for the quick response Doug! > > Unfortunately, I can't be specific about the cluster size, other than to say > it's got more than a thousand nodes. > > In a separate test that I had missed, even "srun hostname" took 5 minutes to > run. So there was no remote file system or MPI involvement. > > Andy > > -----Original Message----- > From: slurm-users > [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] > On Behalf Of Douglas Jacobsen > Sent: Friday, April 26, 2019 9:24 AM > To: Slurm User Community List > <slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>> > Subject: Re: [slurm-users] job startup timeouts? > > How large is very large? Where is the executable being started? In > the parallel filesystem/NFS? If that is the case you may be able to > trim start times by using sbcast to transfer the executable (and its > dependencies if dynamically linked) into a node-local resource, such > as /tmp or /dev/shm depending on your local configuration. > ---- > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > Acting Group Lead, Computational Systems Group > National Energy Research Scientific Computing Center > dmjacob...@lbl.gov<mailto:dmjacob...@lbl.gov> > > ------------- __o > ---------- _ '\<,_ > ----------(_)/ (_)__________________________ > > > On Fri, Apr 26, 2019 at 5:34 AM Andy Riebs > <andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>> wrote: > > > > Hi All, > > > > We've got a very large x86_64 cluster with lots of cores on each node, and > > hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on > > CentOS 7.6. > > > > We have a job that reports > > > > srun: error: timeout waiting for task launch, started 0 of xxxxxx tasks > > srun: Job step 291963.0 aborted before step completely launched. > > > > when we try to run it at large scale. We anticipate that it could take as > > long as 15 minutes for the job to launch, based on our experience with > > smaller numbers of nodes. > > > > Is there a timeout setting that we're missing that can be changed to > > accommodate a lengthy startup time like this? > > > > Andy > > > > -- > > > > Andy Riebs > > andy.ri...@hpe.com<mailto:andy.ri...@hpe.com> > > Hewlett-Packard Enterprise > > High Performance Computing Software Engineering > > +1 404 648 9024 > > My opinions are not necessarily those of HPE > > May the source be with you! >