Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-17 Thread Christopher Samuel
On 7/17/19 4:05 AM, Andy Georges wrote: Can you show what your /etc/pam.d/sshd looks like? For us it's actually here: --- # cat /etc/pam.d/common-account #%PAM-1.0 # # This file is autogenerated by pam-config. All changes # will be o

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-17 Thread Sean Crosby
Hi Andy, We have RHEL7, and pam_slurm_adopt is working for us as well, with memory constraint working pam.d/sshd: #%PAM-1.0 auth required pam_sepermit.so auth substack password-auth auth include postlogin # Used with polkit to reauthorize users in remote sessions -

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-17 Thread Will Dennis
OK, as it turns out, it was a problem like this bug: https://bugs.schedmd.com/show_bug.cgi?id=3819 ( cf https://bugs.schedmd.com/show_bug.cgi?id=2741 as well ) Back in May, I posted the following thread: https://lists.schedmd.com/pipermail/slurm-users/2019-May/003372.html - to which I never go

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-17 Thread Will Dennis
Not thinking that the server (which runs both the Slurm controller daemon and the DB) is the issue... It's a Dell PowerEdge R430 platform, with dual Intel Xeon E5-2640v3 CPUs and 256GB memory, and RAID-1 array of 1TB SATA disks. top - 09:29:26 up 101 days, 14:57, 3 users, load average: 0.06,

Re: [slurm-users] Cluster-wide GPU Per User limit

2019-07-17 Thread David Rhey
Unfortunately, I think you're stuck in setting it at the account level with sacctmgr. You could also set that limit as part of a QoS and then attach the QoS to the partition. But I think that's as granular as you can get for limiting TRES'. HTH! David On Wed, Jul 17, 2019 at 10:11 AM Mike Harvey

Re: [slurm-users] sacct issue: jobs staying in "RUNNING" state

2019-07-17 Thread Brian W. Johanson
On 7/17/19 12:26 AM, Chris Samuel wrote: On 16/7/19 11:43 am, Will Dennis wrote: [2019-07-16T09:36:51.464] error: slurmdbd: agent queue is full (20140), discarding DBD_STEP_START:1442 request So it looks like your slurmdbd cannot keep up with the rate of these incoming steps and is having

[slurm-users] Cluster-wide GPU Per User limit

2019-07-17 Thread Mike Harvey
Is it possible to set a cluster level limit of GPUs per user? We'd like to implement a limit of how many GPUs a user may use across multiple partitions at one time. I tried this, but it obviously isn't correct: # sacctmgr modify cluster slurm_cluster set MaxTRESPerUser=gres/gpu=2  Unknown o

[slurm-users] Correct way in sbatch/srun to switch primary UNIX group.

2019-07-17 Thread Viviano, Brad
Our site has been going through the process of upgrading SLURM on our primary cluster which was delivered to us with Slurm 16.05 with Bright Computing. We're currently at 17.02.13-2 and working to get to 17.11 and then 18.08. We've run into an issue with 17.11 and switching effective GID on a

Re: [slurm-users] pam_slurm_adopt and memory constraints?

2019-07-17 Thread Andy Georges
Hi Mark, Chris, On Mon, Jul 15, 2019 at 01:23:20PM -0400, Mark Hahn wrote: > > Could it be a RHEL7 specific issue? > > no - centos7 systems here, and pam_adopt works. Can you show what your /etc/pam.d/sshd looks like? Kind regards, -- Andy signature.asc Description: PGP signature