Re: [slurm-users] Elastic Compute

2018-09-12 Thread Jacob Jenson
un into the same circumstances you have. > >> > >> -- > >> Brian D. Haymore > >> University of Utah > >> Center for High Performance Computing > >> 155 South 1452 East RM 405 > >> Salt Lake City, Ut 84112 > >> Phone: 801-558-1150, Fax: 801-585-5366 > >> http://bit.ly/1HO1N2C &

Re: [slurm-users] Elastic Compute

2018-09-12 Thread Eli V
t; University of Utah >> Center for High Performance Computing >> 155 South 1452 East RM 405 >> Salt Lake City, Ut 84112 >> Phone: 801-558-1150, Fax: 801-585-5366 >> http://bit.ly/1HO1N2C >> >> >> From: slurm

Re: [slurm-users] Elastic Compute

2018-09-11 Thread Felix Wolfheimer
84112 > Phone: 801-558-1150, Fax: 801-585-5366 > http://bit.ly/1HO1N2C > > > From: slurm-users [slurm-users-boun...@lists.schedmd.com] on behalf of > Chris Samuel [ch...@csamuel.org] > Sent: Monday, September 10, 2018 4:17 PM > To: s

Re: [slurm-users] Elastic Compute

2018-09-10 Thread Brian Haymore
:17 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Elastic Compute On Tuesday, 11 September 2018 12:52:27 AM AEST Brian Haymore wrote: > I believe the default value of this would prevent jobs from sharing a node. But the jobs _do_ share a node when the resources become availa

Re: [slurm-users] Elastic Compute

2018-09-10 Thread Chris Samuel
On Tuesday, 11 September 2018 12:52:27 AM AEST Brian Haymore wrote: > I believe the default value of this would prevent jobs from sharing a node. But the jobs _do_ share a node when the resources become available, it's just that the cloud part of Slurm is bringing up the wrong number of nodes c

Re: [slurm-users] Elastic Compute

2018-09-10 Thread Brian Haymore
I believe the default value of this would prevent jobs from sharing a node. You may want to look at this and change it from the default. -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150, Fax: 801-

Re: [slurm-users] Elastic Compute

2018-09-10 Thread Eli V
I think you probably want CR_LLN set in your SelectTypeParameters in slurm.conf. This makes it fill up a node before moving on to the next instead of "striping" the jobs across the nodes. On Mon, Sep 10, 2018 at 8:29 AM Felix Wolfheimer wrote: > > No this happens without the "Oversubscribe" parame

Re: [slurm-users] Elastic Compute

2018-09-10 Thread Felix Wolfheimer
No this happens without the "Oversubscribe" parameter being set. I'm using custom resources though: GresTypes=some_resource NodeName=compute-[1-100] CPUs=10 Gres=some_resource:10 State=CLOUD Submission uses: sbatch --nodes=1 --ntasks-per-node=1 --gres=some_resource:1 But I just tried it withou

Re: [slurm-users] Elastic Compute

2018-09-09 Thread Brian Haymore
What do you have the OverSubscribe parameter set on the partition your using? -- Brian D. Haymore University of Utah Center for High Performance Computing 155 South 1452 East RM 405 Salt Lake City, Ut 84112 Phone: 801-558-1150, Fax: 801-585-5366 http://bit.ly/1HO1N2C

Re: [slurm-users] Elastic Compute on Cloud - Error Handling

2018-07-30 Thread Felix Wolfheimer
After a bit more testing I can answer my original question: I was just too impatient. When the ResumeProgram comes back with an exit code != 0 SLURM doesn't taint the node, i.e., it tries to start it again after a while. Exactly what I want! :-) @Lachlan Musicman: My slurm.conf Node and Partition

Re: [slurm-users] Elastic Compute on Cloud - Error Handling

2018-07-28 Thread Lachlan Musicman
On 29 July 2018 at 04:32, Felix Wolfheimer wrote: > I'm experimenting with SLURM Elastic Compute on a cloud platform. I'm > facing the following situation: Let's say, SLURM requests that a compute > instance is started. The ResumeProgram tries to create the instance, but > doesn't succeed because