Our site has been going through the process of upgrading SLURM on our primary 
cluster which was delivered to us with Slurm 16.05 with Bright Computing.  
We're currently at 17.02.13-2 and working to get to 17.11 and then 18.08.  
We've run into an issue with 17.11 and switching effective GID on a 
sbatch/srun.  I've only found one mention of this issue in the archive and no 
specific resolution:

  https://groups.google.com/forum/#!topic/slurm-users/YZlTqBoMZ0o

Our site has a "many projects -> single user" mapping.  So a given user is 
likely to be in 3+ projects, which map to corresponding SLURM accounts in 
sacctmgr.  For each SLURM account, we create a corresponding POSSIX/UNIX group 
of the same name and setup a directory on our GPFS storage appropriately owned 
by that group, with a disk quota.

We made the switch to Slurm from Torque a few year back.  In Torque we where 
using "-W group_list=" option to allow the user to change their effective GID 
to one of their auxiliary groups on a per job basis.  In 17.02 and earlier, 
we've been using the --gid= option to similar effect, allowing users to switch 
their effective GID for a given job to one of their Auxiliary groups that 
matches the project they are burning time for.

On 17.02 the following works fine:

[login1] $ id
uid=1000(user1) gid=1000(users) groups=1000(users),1001(test)

[login1] $ srun --account=general --pty /bin/bash -i
[compute1] $ id --group
1000

[login1] $ srun --account=test --gid=test --pty /bin/bash -i
[compute1] $ id --group
1001

On 17.11, using --gid gets an error:

[login1] $ srun --account=test --gid=test --pty /bin/bash -i
srun: error: --gid only permitted by root user

The only work around I've found that mimics the same behavior is to use 
"newgrp" or "sg" on the login node, to switch the auxiliary group to be the 
effective group during submit:

[login1] $ sg test 'srun --account=test --pty /bin/bash -i'
[compute1] $ id --group
1001

I've reviewed the slurm-users archive, bug notes, etc and understand the reason 
the change was made to disallow --uid/--gid except for root.  What I am looking 
for is information/suggestions on the best way to mimic the 17.02 and earlier 
functionality in a secure way.

I've already attempted to write a JobSubmit plugin and in the "extern int 
job_submit" function overwrite job_desc->group_id to use an alternate group ID 
based on job_desc->account.  But that resulted in an error on the slurmd side:

[2019-07-16T13:26:20.073] error: job 95 credential created for gid 1001, 
expected 1000
[2019-07-16T13:26:20.073] error: Invalid job credential from 1000@172.20.2.2: 
Invalid job credential

I then attempted a spank plugin to add my own option, "--egid", to srun/sbatch 
and attempting to overwrite the GID that slurm picked up from the login node.  
I was able to bring in "--egid=test", resolve the group name to a GID number, 
but no matter which slurm_spank_* function I tried, using the "setegid" or 
"setgid" system calls didn't hold.  Once the actual slurmstepd process started, 
the effective GID in the user process was 1000 (the GID used when sbatch/srun 
was run) instead of 1001.

I've been hoping there is something I missed either native to SLURM or in the 
JobSubmit/SPANK plugin that would let me have the ability to allow users to 
switch their effective GID on a per job basis to any of the groups they belong 
too.

Thanks,
     -Brad Viviano



===================================================
Brad Viviano

Senior Systems Engineer

Reply via email to