Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-16 Thread Baker D . J .
Hi Chris, Thank you for your reply regarding OpenMPI and srun. When I try to run an mpi program using srun I find the following.. red[036-037] [red036.cluster.local:308110] PMI_Init [pmix_s1.c:168:s1_init]: PMI is not initialized [red036.cluster.local:308107] PMI_Init [pmix_s1.c:168:s1_init]:

Re: [slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-11 Thread Baker D . J .
Hi Chris, Thank you for your comments. Yesterday I experimented with increasing the PriorityWeightJobSize and that does appear to have quite a profound effect on the job mix executing at any one time. Larger jobs (needing 5 nodes or above) are now getting a decent share of the nodes in the clu

[slurm-users] Larger jobs tend to get starved out on our cluster

2019-01-09 Thread Baker D . J .
Hello, A colleague intimated that he thought that larger jobs were tending to get starved out on our slurm cluster. It's not a busy time at the moment so it's difficult to test this properly. Back in November it was not completely unusual for a larger job to have to wait up to a week to start.

Re: [slurm-users] Visualisation -- Slurm and (Turbo)VNC

2019-01-04 Thread Baker D . J .
Hello, Thank you for your comments on installing and using TurboVNC. I'm working on the installation at the moment, and may get back with other questions relating to the use of Slurm with VNC. Best regards, David From: slurm-users on behalf of Daniel Letai

[slurm-users] Visualisation -- Slurm and (Turbo)VNC

2019-01-03 Thread Baker D . J .
Hello, We have set up our NICE/DCV cluster and that is proving to be very popular. There are, however, users who would benefit from using the resources offered by our nodes with multiple GPU cards. This potentially means setting up TurboVNC, for example. I would, if possible, like to be able t

[slurm-users] PrologFlags=Contain significantly changing job activity on compute nodes

2018-12-12 Thread Baker D . J .
Hello, I wondered if someone could please help us to understand why the PrologFlags=contain flag is causing jobs to fail and draining compute nodes. We are, by the way, using slurm 18.08.0. Has anyone else seem this behaviour? I'm currently experimenting with PrologFlags=contain. I've found tha

Re: [slurm-users] Excessive use of backfill on a cluster

2018-11-21 Thread Baker D . J .
ds, David From: slurm-users on behalf of Chris Samuel Sent: 20 November 2018 20:12:20 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Excessive use of backfill on a cluster On Tuesday, 20 November 2018 11:42:49 PM AEDT Baker D. J. wrote: > We are running Slu

Re: [slurm-users] Excessive use of backfill on a cluster

2018-11-21 Thread Baker D . J .
Hi Lois Thank you for sharing your multi priority configuration with us. I understand why you say about the QOS factor -- I've reduced it and increased the FS factor to see where that takes us. Our QOS factor is only there to ensure that test jobs gain a higher priority more quickly than other

Re: [slurm-users] Excessive use of backfill on a cluster

2018-11-20 Thread Baker D . J .
Hello, Thank you for your reply and for the explanation. That makes sense -- your explanation of backfill is as we expected. I think it's more that we are surprised that almost all our jobs were being scheduled using backfill. We very rarely see any being scheduled normally. It could be that w

[slurm-users] Excessive use of backfill on a cluster

2018-11-20 Thread Baker D . J .
Hello, We are running Slurm 18.08.0 on our cluster and I am concerned that Slurm appears to be using backfill scheduling excessively. In fact the vast majority of jobs are being scheduled using backfill. So, for example, I have just submitted a set of three serial jobs. They all started on a c

Re: [slurm-users] Seff error with Slurm-18.08.1

2018-11-06 Thread Baker D . J .
Hello Mike et al, This is a known bug in slurm v18.08*. We installed the initial release a short while ago and came across this issue very quickly. We actually use this script at the end of the job epilog to report job efficiency to users, and so it is real shame that it is now broken! The goo

Re: [slurm-users] Help with developing a lua job submit script

2018-10-10 Thread Baker D . J .
Hello, Thank you for your useful replies. It certainly not anywhere as difficult as I initially thought. We should be able to start some tests later this week. Best regards, David From: slurm-users on behalf of Roche Ewan Sent: 10 October 2018 08:07 To: S

[slurm-users] Help with developing a lua job submit script

2018-10-09 Thread Baker D . J .
Hello, We are starting to think about developing a lua job submission script. For example, we are keen to route jobs requiring no more than 1 compute node (single core jobs and small parallel jobs) to a slurm shared partition. The idea being that "small" jobs can share a small set of compute n

[slurm-users] Node/job failures following scontrol reconfigure command

2018-10-04 Thread Baker D . J .
Hello, We have just finished an upgrade to slurm 18.08. My last task was to reset the slurmctld/slurmd timeouts to sensible values -- as they were set prior to the update. That is.. SlurmctldTimeout= 60 sec SlurmdTimeout = 300 sec With slurm <18.08 I've reconfigure the clu

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-26 Thread Baker D . J .
om: slurm-users on behalf of Chris Samuel Sent: 26 September 2018 11:26 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08 On Tuesday, 25 September 2018 11:54:31 PM AEST Baker D. J. wrote: > That will certainly work, however the slur

Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-25 Thread Baker D . J .
David From: slurm-users on behalf of Chris Samuel Sent: 25 September 2018 13:00 To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08 On Tuesday, 25 September 2018 9:41:10 PM AEST Baker D. J. wrote: > I guess that the only sol

[slurm-users] Upgrading a slurm on a cluster, 17.02 --> 18.08

2018-09-25 Thread Baker D . J .
Hello, I wondered if I could compare notes with other community members who have upgraded slurm on their cluster. We are currently running slurm v17.02 and I understand that the rpm mix/structure changed at v17.11. We are, by the way, planning to upgrade to v18.08. I gather that I should upg

[slurm-users] Advice on managing GPU cards using SLURM

2018-03-05 Thread Baker D . J .
Hello, I'm sure that this question has been asked before. We have recently added some GPU nodes to our SLURM cluster. There are 10 nodes each providing 2 * Tesla V100-PCIE-16GB cards There are 10 nodes each providing 4 * GeForce GTX 1080 Ti cards I'm aware that the simplest way to manage these