Re: [slurm-users] Determine usage for a QOS?
A while ago, I thought a patch was made to sshare to show raw tres usage. Something like sshare -o account,user,GrpTRESRaw At the time I used this, I was only concerned with account usage, so I didn't look to see if sshare would work on the QOS level. I'm not sure that "feature" was in the man page last time I looked. - Gary Skouson -Original Message- From: slurm-users On Behalf Of Christopher Samuel Sent: Sunday, August 19, 2018 9:39 PM To: slurm-users@lists.schedmd.com Subject: Re: [slurm-users] Determine usage for a QOS? Hi Paul, On 20/08/18 11:36, Paul Edmon wrote: > I don't really have enough experience with QoS's to give a slicker > method but you could use squeue --qos to poll the QoS and then write a > wrapper to do the summarization. It's hacky but it should work. I was thinking sacct -q ${QOS} to pull info out of the DB, but as Slurm will be keeping this info locally to determine whether new jobs can use the QOS I wondered if there was a less heavy-handed way to get it. I might dig into the code before opening a support request. cheers! Chris -- Chris Samuel : https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F&data=02%7C01%7Cgbs35%40psu.edu%7C5118db0eb1ef46588f3608d6063dc876%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C636703259759774650&sdata=7mRWpfhD%2BtzOBNhwWm9VhbWm6k8GKG9sCRHV3q19nbY%3D&reserved=0 : Melbourne, VIC
[slurm-users] pam_slurm_adopt does not constrain memory?
hi, we observed a strange behavior of pam_slurm_adopt regarding the involved cgroups: when we start a shell as a new Slurm job using "srun", the process has freezer, cpuset and memory cgroups setup as e.g. "/slurm/uid_5001/job_410318/step_0". that's good! however, another shell started by an SSH login is handled by pam_slurm_adopt. that process is only affected by the freezer and cpuset cgroups setup as "/slurm/uid_5001/job_410318/step_extern". it lacks the configuration of the "memory" cgroup. (see output below) as a consequence, all tools started from this shell prompt are not affected by any memory restrictions. that's bad for our use case as we need to partition the memory of our SMP machines for several independent jobs/users. is this an expected behavior of pam_slurm_adopt/slurmstepd? or maybe a configuration issue? did i miss something? a bug? to me, it looks similar to this old issue... https://bugs.schedmd.com/show_bug.cgi?id=2236 we're currently running Slurm 17.11.8. (we've already seen this with our previous version 17.11.5.) thanks for your help and suggestions! christian -- == cgroups within srun == login$ srun --pty bash node064$ cat /proc/self/cgroup 11:pids:/system.slice/slurmd.service 10:freezer:/slurm/uid_501/job_410318/step_0 9:cpuset:/slurm/uid_501/job_410318/step_0 8:cpuacct,cpu:/system.slice/slurmd.service 7:net_prio,net_cls:/ 6:blkio:/system.slice/slurmd.service 5:perf_event:/ 4:devices:/system.slice/slurmd.service 3:memory:/slurm/uid_501/job_410318/step_0 2:hugetlb:/ 1:name=systemd:/system.slice/slurmd.service == cgroups for external step == login$ ssh node064 node064$ cat /proc/self/cgroup 11:pids:/user.slice 10:freezer:/slurm/uid_501/job_410318/step_extern 9:cpuset:/slurm/uid_501/job_410318/step_extern 8:cpuacct,cpu:/user.slice 7:net_prio,net_cls:/ 6:blkio:/user.slice 5:perf_event:/ 4:devices:/user.slice 3:memory:/user.slice 2:hugetlb:/ 1:name=systemd:/user.slice/user-501.slice/session-430.scope <>
Re: [slurm-users] pam_slurm_adopt does not constrain memory?
Hi Christian, On Wed, Aug 22, 2018 at 7:27 AM, Christian Peter wrote: > we observed a strange behavior of pam_slurm_adopt regarding the involved > cgroups: > > when we start a shell as a new Slurm job using "srun", the process has > freezer, cpuset and memory cgroups setup as e.g. > "/slurm/uid_5001/job_410318/step_0". that's good! > > however, another shell started by an SSH login is handled by > pam_slurm_adopt. that process is only affected by the freezer and cpuset > cgroups setup as "/slurm/uid_5001/job_410318/step_extern". it lacks the > configuration of the "memory" cgroup. (see output below) My guess is that you're experiencing first-hand the awesomeness of systemd. The SSH session likely inherits the default user.slice systemd cgroups, that take over and override the ones set by Slurm. So, instead of inheriting the job's limit via the pam_slurm_adopt module, your SSH shell gets the default systemd cgroup settings, which are useless in your context. I usually get rid of any reference to pam_systemd.so in /etc/pam.d/ and sometimes push it a bit further and delete /lib*/security/pam_systemd.so, which inevitably brings a blissful grin on my face. You may want to take a look at the following bugs: * https://bugs.schedmd.com/show_bug.cgi?id=3912 * https://bugs.schedmd.com/show_bug.cgi?id=3674 * https://bugs.schedmd.com/show_bug.cgi?id=3158 but they all boil down to the same conclusion. Cheers, -- Kilian
Re: [slurm-users] pam_slurm_adopt does not constrain memory?
On 08/22/2018 10:58 AM, Kilian Cavalotti wrote: My guess is that you're experiencing first-hand the awesomeness of systemd. Yes, systemd uses cgroups. I'm trying to understand if the Slurm use of cgroups is incompatible with systemd, or if there is another way to resolve this issue? Looking at the man page for pam_systemd, it looks reasonably safe to disable this for logins on HPC compute nodes. I think you will also need to mask the systemd-logind service if you remove the pam module: systemctl stop systemd-logind systemctl mask systemd-logind
[slurm-users] "Partly Cloudy" conference Oct'18 in Seattle
All, Registration is open for the 2018 Partly Cloudy conference: http://partly-cloudy.fredhutch.org While this conference is not Slurm specific, many smaller Slurm shops have already confirmed their attendance to the "Partly Cloudy" conference in Seattle so this shameless plug feels relevant to some of you. Shops that provide compute infrastructures (HPC, big Data, Containers) continue to be challenged by questions such as "what to cloud", "if to cloud", "when to cloud". Some of us want to use the cloud at certain times (bursting) and others bring up certain services in the cloud and combine them with on-prem resources in a cost efficient way and ...surprise ... things do not always work as advertised. Over the last year we at Fred Hutch have made significant some significant investments in getting integrated Slurm clusters running in AWS and Google cloud. AIRI conference: https://www.slideshare.net/dirkpetersen/scientific-computing-fred-hutch Google Next: https://www.youtube.com/watch?v=Fw_CR2WgPRY&t=5s While I am still working with all 3 cloud providers in Seattle on recruiting speakers with a focus on hybrid-cloud I am forwarding the announcement my friend Stuart Kendrick (who did much of the real work) already sent to his large network, apologies for any duplicates and perhaps I see you in Seattle this fall. Thanks much Dirk ## Hi folks, Registration is open for the 2018 Partly Cloudy conference: http://partly-cloudy.fredhutch.org This conference is aimed at IT infrastructure folks who want to talk with peers about how to flow between on-prem systems and cloud systems, in support of scientific research. Dirk Petersen (Fred Hutch) is driving this, with support from myself (Allen Institute / coordination), Bob Robbins (consultant), and Donna Obuchowski (Fred Hutch / logistics). Format 10/24 Dinner Informal discussion, get to know each other 10/25 Morning The speakers talk about their experiences, Q&A 10/25 Afternoon Split into working groups to talk informally about current challenges 10/25 End-of-the-Day Bring everyone together to summarize lessons-learned Mostly, we expect a regional audience, and we emphasize attending in-person. That being said, this is a prototype run for what we imagine will be a nationally-relevant conference, which we intend to host in 2019. To that end, we will offer, experimentally, virtual attendance - streamed plus the ability to 'raise your hand and ask a question'. Dirk Petersen Scientific Computing Director Fred Hutch 1100 Fairview Ave. N. Mail Stop M4-A882 Seattle, WA 98109 Phone: 206.667.5926 Skype: internetchen [cid:8C6A9079-96CB-447C-94D9-DD59438042C1]
Re: [slurm-users] Job cannot start on slurm v18.08.0pre2
Hi, My test script is like this: = #!/bin/bash #SBATCH -J LOOP #SBATCH -p low #SBATCH --comment test #SBATCH -N 1 #SBATCH -n 5 #SBATCH -o log/%j.loop #SBATCH -e log/%j.loop date echo "SLURM_JOB_NODELIST=${SLURM_JOB_NODELIST}" echo "SLURM_NODELIST=${SLURM_NODELIST}" sleep 2100 echo "step 3 over" date = If I get rid of srun and run sleep directly, the phenomenon is the same. In addition, I did not enable the two parameters of MpiDefault and MpiParams in the configuration file slurm.conf. so, what is the possible reason for this problem? zhangtao102...@126.com From: Artem Polyakov Date: 2018-08-22 06:02 To: Slurm User Community List CC: slurm-users Subject: Re: [slurm-users] Job cannot start on slurm v18.08.0pre2 Hello, I can try to tell from PMIx/UCX perspective. Do you have "MPI=pmix" parameter in your slurm.conf or have you specified "--mpi=pmix" in your srun command? If not - you are not running PMIx and thus UCX (UCX support is only in the PMIx plugin). I think this is confirmed by the log output that you have provided, I don't see any traces of PMIx plugin. пт, 17 авг. 2018 г. в 20:43, zhangtao102...@126.com : Hi, I have installed SLURM 18.08.0-0pre2 on a my cluster based on RHEL7.4 (x86_64). My configure parameters likes this: ./configure --prefix=/opt/slurm17 --with-munge=/opt/munge --with-pmix=/opt/pmix --with-ucx=/opt/openucx --with-hwloc=/usr (openucx version is 1.5.0, pmix version is 3.0.0, hwloc version is 1.11.8) After completing the installation and configuration, it looks like slurm is working normally. But when I submitted a simple test job with sbatch sleep.sh(just call srun sleep 30 at single computing node), I found that the job (ID=1032) state was R, but the job did not start normally on the computation node (no process found). The appendix is the output log of the computing node of the management node. I can't tell if the cause of this problem is related to the compilation parameters I specify (such as pmix, ucx), and I've never seen anything similar in earlier versions. Has anyone ever responded to a similar phenomenon with me? How to solve the problem? Best regards zhangtao102...@126.com -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov