Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Prentice Bisbal
We're running nscd on all nodes, with an extremely stable list of users/accounts, so I think we should be good here. Don't bet on it. I've had issues in the past with nscd in similar situations to this.  There's a reason that daemon has a "paranoid" option. Hostname should be completely local

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Christopher Samuel
On 4/26/19 7:29 AM, Riebs, Andy wrote: In a separate test that I had missed, even "srun hostname" took 5 minutes to run. So there was no remote file system or MPI involvement. Worth trying: srun /bin/hostname Just in case there's something weird in the path that causes it to hit a network

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
Hi John, > It's a DNS problem, isn't it?   Seriously though - how long does srun hostname take for a single system? We're running nscd on all nodes, with an extremely stable list of users/accounts, so I think we should be good here. "time srun hostname" reports on the order of 0.2 seconds,

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
Thanks Doug -- your cluster is bigger than mine, and your answer ("a few seconds") is much closer to what I was expecting to see here. > Do you know if all the slurmstepd's are starting quickly on the compute nodes? We'll be looking into this. > How is the OS/Slurm/executable delivered to th

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread John Hearns
It's a DNS problem, isn't it? Seriously though - how long does srun hostname take for a single system? On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen wrote: > We have 12,000 nodes in our system, 9,600 of which are KNL. We can > start a parallel application within a few seconds in most cases

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Douglas Jacobsen
We have 12,000 nodes in our system, 9,600 of which are KNL. We can start a parallel application within a few seconds in most cases (when the machine is dedicated to this task), even at full scale. So I don't think there is anything intrinsic to Slurm that would necessarily be limiting you, though

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Riebs, Andy
Thanks for the quick response Doug! Unfortunately, I can't be specific about the cluster size, other than to say it's got more than a thousand nodes. In a separate test that I had missed, even "srun hostname" took 5 minutes to run. So there was no remote file system or MPI involvement. Andy -

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Douglas Jacobsen
How large is very large? Where is the executable being started? In the parallel filesystem/NFS? If that is the case you may be able to trim start times by using sbcast to transfer the executable (and its dependencies if dynamically linked) into a node-local resource, such as /tmp or /dev/shm dep

[slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
Hi All, We've got a very large x86_64 cluster with lots of cores on each node, and hyper-threading enabled. We're running Slurm 18.08.7 with Open MPI 4.x on CentOS 7.6. We have a job that reports srun: error: timeout waiting for task launch, started 0 of xx tasks srun: Job step 291

[slurm-users] Node Specific Core Distribution

2019-04-26 Thread Sam Gallop (NBI)
Hi All, I'm hoping that someone many have encountered this scenario and knows of a solution. Basically we wish to change the default core distribution but only for specific compute nodes. The current default distribution is cyclic, but for specific nodes we would like to override this behaviour