Re: [slurm-users] Can one specify attributes on a GRES resource?

2019-03-21 Thread Will Dennis
Yes, all the nodes with those GRES types have a gres.conf with those names/counts set... -Original Message- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Chris Samuel Sent: Thursday, March 21, 2019 10:47 PM To: slurm-users@lists.schedmd.com Subject: Re: [s

Re: [slurm-users] Can one specify attributes on a GRES resource?

2019-03-21 Thread Chris Samuel
On 21/3/19 7:39 pm, Will Dennis wrote: Why does it think that the "gres/gpu_mem_per_card" count is 0? How can I fix this? Did you remember to distribute gres.conf as well to the nodes? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] Can one specify attributes on a GRES resource?

2019-03-21 Thread Will Dennis
I tried doing this as follows: Node's gres.conf: ## # Slurm's Generic Resource (GRES) configuration file ## Name=gpu File=/dev/nvidia0 Type=1050TI Name=gpu_mem_per_card C

Re: [slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-21 Thread Reuti
Am 21.03.2019 um 23:43 schrieb Prentice Bisbal: > Slurm-users, > > My users here have developed a GUI application which serves as a GUI > interface to various physics codes they use. From this GUI, they can submit > jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and

[slurm-users] Slurm doesn't call mpiexec or mpirun when run through a GUI app

2019-03-21 Thread Prentice Bisbal
Slurm-users, My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has reported a problem when submitting Slurm jobs t

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
On 3/21/19 4:40 PM, Reuti wrote: Am 21.03.2019 um 16:26 schrieb Prentice Bisbal : On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I understand. A

Re: [slurm-users] Database Tuning w/SLURM

2019-03-21 Thread Prentice Bisbal
On 3/21/19 1:56 PM, Ryan Novosielski wrote: On Mar 21, 2019, at 12:21 PM, Loris Bennett wrote: Our last cluster only hit around 2.5 million jobs after around 6 years, so database conversion was never an issue. For sites with a higher-throughput things may be different, but I would hope tha

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
Prentice Bisbal Lead Software Engineer Princeton Plasma Physics Laboratory http://www.pppl.gov On 3/21/19 12:21 PM, Loris Bennett wrote: Hi Ryan, Ryan Novosielski writes: On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote: On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
On 3/21/19 11:49 AM, Ryan Novosielski wrote: On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote: On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Daniel Letai
Hi Loris, On 3/21/19 6:21 PM, Loris Bennett wrote: Chris, maybe you should look at EasyBuild (https://easybuild.readthedocs.io/en/latest/). That way you can install all the dependencies (such as zlib) as modules and be pretty much independent of

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Reuti
> Am 21.03.2019 um 16:26 schrieb Prentice Bisbal : > > > On 3/20/19 1:58 PM, Christopher Samuel wrote: >> On 3/20/19 4:20 AM, Frava wrote: >> >>> Hi Chris, thank you for the reply. >>> The team that manages that cluster is not very fond of upgrading SLURM, >>> which I understand. > > As a sy

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-21 Thread Daniel Letai
Hi Peter, On 3/20/19 11:19 AM, Peter Steinbach wrote: [root@ernie /]# scontrol show node -dd g1 NodeName=g1 CoresPerSocket=4    CPUAlloc=3 CPUTot=4 CPULoad=N/A    AvailableFeatures=(null)    ActiveFeat

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Goetz, Patrick G
There are 2 kinds of system admins: can do and can't do. You're a can do; his are can't do. On 3/21/19 10:26 AM, Prentice Bisbal wrote: > > On 3/20/19 1:58 PM, Christopher Samuel wrote: >> On 3/20/19 4:20 AM, Frava wrote: >> >>> Hi Chris, thank you for the reply. >>> The team that manages that

[slurm-users] Database Tuning w/SLURM (was: Re: SLURM heterogeneous jobs, a little help needed plz)

2019-03-21 Thread Ryan Novosielski
> On Mar 21, 2019, at 12:21 PM, Loris Bennett > wrote: > > Our last cluster only hit around 2.5 million jobs after > around 6 years, so database conversion was never an issue. For sites > with a higher-throughput things may be different, but I would hope that > at those places, the managers wo

Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread Christopher Samuel
On 3/21/19 6:55 AM, David Baker wrote: it currently one of the highest priority jobs in the batch partition queue What does squeue -j 359323 --start say? -- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Christopher Samuel
On 3/21/19 9:21 AM, Loris Bennett wrote: Chris, maybe you should look at EasyBuild (https://easybuild.readthedocs.io/en/latest/). That way you can install all the dependencies (such as zlib) as modules and be pretty much independent of the ancient packages your distro may provide (other softwar

Re: [slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Alex Chekholko
Hi Noam, Right, xdmod is a standard LAMP stack webapp. You can see some pictures of the graphs in the web interface in a google image search here https://www.google.com/search?q=xdmod&source=lnms&tbm=isch&sa=X It may also require a fairly beefy database backend, depending on how many millions of

Re: [slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Noam Bernstein
> On Mar 21, 2019, at 12:38 PM, Alex Chekholko wrote: > > Hey Graziano, > > To make your decision more "data-driven", you can pipe your SLURM accounting > logs into a tool like XDMOD which will make you pie charts of usage by user, > group, job, gres, etc. > > https://open.xdmod.org/8.0/inde

Re: [slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Alex Chekholko
Hey Graziano, To make your decision more "data-driven", you can pipe your SLURM accounting logs into a tool like XDMOD which will make you pie charts of usage by user, group, job, gres, etc. https://open.xdmod.org/8.0/index.html You may also consider assigning this task to one of your "machine

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Loris Bennett
Hi Ryan, Ryan Novosielski writes: >> On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote: >> On 3/20/19 1:58 PM, Christopher Samuel wrote: >>> On 3/20/19 4:20 AM, Frava wrote: >>> Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLUR

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Ryan Novosielski
> On Mar 21, 2019, at 11:26 AM, Prentice Bisbal wrote: > On 3/20/19 1:58 PM, Christopher Samuel wrote: >> On 3/20/19 4:20 AM, Frava wrote: >> >>> Hi Chris, thank you for the reply. >>> The team that manages that cluster is not very fond of upgrading SLURM, >>> which I understand. > > As a syste

[slurm-users] practical tips to budget cluster expansion for a research center with heterogeneous workloads?

2019-03-21 Thread Graziano D'Innocenzo
Dear Slurm users, my team is managing a HPC cluster (running Slurm) for a research centre. We are planning to expand the cluster in the next couple of years and we are facing a problem. We would like to put a figure on how many resources will be needed on average for each user (in terms of CPU cor

Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread David Baker
Hi Cyrus, Thank you for the links. I've taken a good look through the first link (re the cloud cluster) and the only parameter that might be relevant is "assoc_limit_stop", but I'm not sure if that is relevant in this instance. The reason for the delay of the job in question is "priority", how

Re: [slurm-users] SLURM heterogeneous jobs, a little help needed plz

2019-03-21 Thread Prentice Bisbal
On 3/20/19 1:58 PM, Christopher Samuel wrote: On 3/20/19 4:20 AM, Frava wrote: Hi Chris, thank you for the reply. The team that manages that cluster is not very fond of upgrading SLURM, which I understand. As a system admin who manages clusters myself, I don't understand this. Our job is

Re: [slurm-users] Very large job getting starved out

2019-03-21 Thread Cyrus Proctor
Hi David, You might have a look at the thread "Large job starvation on cloud cluster" that started on Feb 27; there's some good tidbits in there. Off the top without more information, I would venture that settings you have in slurm.conf end up backfilling the smaller jobs at the expense of sch

[slurm-users] Very large job getting starved out

2019-03-21 Thread David Baker
Hello, I understand that this is not a straight forward question, however I'm wondering if anyone has any useful ideas, please. Our cluster is busy and the QOS has limited users to a maximum of 32 compute nodes on the "batch" queue. Users are making good of the cluster -- for example one user

Re: [slurm-users] Sharing a node with non-gres and gres jobs

2019-03-21 Thread Peter Steinbach
After more tests, the situation clears a bit. If "COREs=0,1" (etc) is present in the `gres.conf` file, then one can inject gres jobs on a single core only by using `--gres-flags=disable-bindung` if a non-gres job is running the same node. If "COREs=0,1" is NOT present in `gres.conf`. then any