Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-01-31 Thread Doug Meyer
Perhaps fire from srun with -vvv to get maximum verbose messages as srun fires through job. Doug On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs wrote: > Hi All, > > Just checking to see if this sounds familiar to anyone. > > Environment: > - CentOS 7.5 x86_64 > - Slurm 17.11.10 (but this also happ

Re: [slurm-users] slurm, memory accounting and memory mapping

2019-01-31 Thread Sergey Koposov
Hi, Thanks again for all the suggestions. It turns out that on our cluster we can't use the cgroups because of the old kernel, but setting JobAcctGatherParams=UsePSS resolved the problems. Regards, Sergey On Fri, 2019-01-11 at 10:37 +0200, Janne Blomqvist wrote: > On 11/01/2019

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
To be more clear, the jobs aren't starting due to the group being at their limit, which is normal. But slurm is spamming that error to the log file for every job that is at a particular GrpTRESRunLimit which is not normal. Other than the log being littered with incorrect error messages, things

[slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-01-31 Thread Andy Riebs
Hi All, Just checking to see if this sounds familiar to anyone. Environment: - CentOS 7.5 x86_64 - Slurm 17.11.10 (but this also happened with 17.11.5) We typically run about 100 tests/night, selected from a handful of favorites. For roughly 1 in 300 test runs, we see one of two mysterious fa

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Samuel
On 1/31/19 8:12 AM, Christopher Benjamin Coffey wrote: This seems to be related to jobs that can't start due to in our case: AssocGrpMemRunMinutes, and AssocGrpCPURunMinutesLimit Must be a bug relating to GrpTRESRunLimit it seems. Do you mean can't start due to not enough time, or can't star

Re: [slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
Hi All, This seems to be related to jobs that can't start due to in our case: AssocGrpMemRunMinutes, and AssocGrpCPURunMinutesLimit Must be a bug relating to GrpTRESRunLimit it seems. Best, Chris — Christopher Coffey High-Performance Computing Northern Arizona University 928-523-1167 On 1

[slurm-users] Slurm 18.08.5 slurmctl error messages

2019-01-31 Thread Christopher Benjamin Coffey
Hi, we upgraded to 18.08.5 this morning and are seeing odd errors in the slurmctld logs: [2019-01-31T08:24:13.684] error: select_nodes: calling _get_req_features() for JobId=16599048 with not NULL job resources [2019-01-31T08:24:13.685] error: select_nodes: calling _get_req_features() for JobId

Re: [slurm-users] service slurmctld restart

2019-01-31 Thread Paul Edmon
No.  Jobs should continue as normal. -Paul Edmon- On 1/31/19 9:38 AM, Buckley, Ronan wrote: Hi, Does restarting the slurmctld daemon on a slurm head node affect running slurm jobs on the compute nodes in any way? Rgds

Re: [slurm-users] Increase MaxJobCount in slurm.conf

2019-01-31 Thread Paul Edmon
Nope per the documentation you have to restart the slurmctld to change MaxJobCount. -Paul Edmon- On 1/31/19 5:58 AM, Buckley, Ronan wrote: Hi, I want to increase the MaxJobCount in the slurm.conf file from its default value of 10,000. I want to increase it to 250,000. The online documenta

[slurm-users] service slurmctld restart

2019-01-31 Thread Buckley, Ronan
Hi, Does restarting the slurmctld daemon on a slurm head node affect running slurm jobs on the compute nodes in any way? Rgds

[slurm-users] Increase MaxJobCount in slurm.conf

2019-01-31 Thread Buckley, Ronan
Hi, I want to increase the MaxJobCount in the slurm.conf file from its default value of 10,000. I want to increase it to 250,000. The online documentation says: MaxJobCount The maximum number of jobs Slurm can have in its active database at one time. Set the values of MaxJobCount and MinJobAge