For our login nodes (smallish, diskless VMs) we try and limit abuse from users through a layered approach as enumerated below.

1. User education

Users of our cluster are required to attend a training that is run by our group.  In these sessions we do  go over what we do and don't allow on the login nodes and do stress that we will kill long running processes if we see it and multiple abuses could get you banned for some duration of time.

2. Set the noexec mount option for any user controlled mountpoint (home, scratch, group/lab/project spaces)

This isn't a perfect solution, as noexec can be worked around if a user understands what noexec means.  For example, a user wouldn't be able to do "./foo.py", but they could do "python foo.py". We also understand some users have a legitimate reason to use a script on the login node, but setting noexec doesn't really to prevent the use of scripts, it just to make it a little harder for a user to abuse the login node.

3. A small partition with shared nodes with low maxtime

For tasks that are typically longer running (compression/decompression, compilation), outside of just user education, we as have a partition with 4 nodes, that limit number of jobs per user (2 jobs running at a time per user) as well as a maxtime of 4 hours.  For most of our users, this covers the cases of compilation, testing and compression/decompression.  This set of nodes are also setup to be shared, so users are required to request number of cores and memory required for either a batch job or interactive job to perform longer running tasks.

4. For our software modules, we make sure to only expose the module files so the module commands work, but do not expose the path to where the compiled software resides.

This prevents users from loading up a module, such as a compiler, and using it to compile code on our login nodes.  If a user can't do the abusive action to begin with, you can't really have a problem. Although, users do sometimes ask us , why the software loaded by a module does not work on the login node, which we then re-educate the user.

5. Make sure we don't install the development tools (gnu compilers or jdk ) on the login nodes

As we need to allow the use of scp and other transfer tools, we can't prevent the execution of all software in /bin.  As a result, we just try to minimize what software a user could potentially use to abuse the login node with.


A layered approach of education and reducing the potential ways a user can abuse our login nodes has been working for us for the past couple of years.  If we do begin to see more login node abuse, we would probably try and layer on the use of cgroups to try and limit memory and cpu usage.


Thanks,
David

Date: Wed, 19 May 2021 19:00:38 +0300
From: Alan Orth <alan.o...@gmail.com>
To: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>, Slurm User
Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] What is an easy way to prevent users run
programs on the master/login node.
Message-ID:
<CAKKdN4U460M0mNtS=b_8qsbbpwzkzp+bqnoqdvkih0z_b1z...@mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Regarding setting limits for users on the head node. We had this for years:

# CPU time in minutes
* - cpu 30
root - cpu unlimited

But we eventually found that this was even causing long-running jobs like
rsync/scp to fail when users were copying data to the cluster. For a while
I blamed our network people, but then I did some tests and found that it
was the limits that were responsible. I have removed this and other limits
for now but I ruthlessly kill heavy processes that my users run on there. I
will look into using cgroups on the head node.

Cheers,

On Sat, Apr 24, 2021 at 11:05 AM Ole Holm Nielsen <
ole.h.niel...@fysik.dtu.dk> wrote:

On 24-04-2021 04:37, Crist?bal Navarro wrote:
Hi Community,
I have a set of users still not so familiar with slurm, and yesterday
they bypassed srun/sbatch and just ran their CPU program directly on the
head/login node thinking it would still run on the compute node. I am
aware that I will need to teach them some basic usage, but in the
meanwhile, how have you solved this type of user-behavior problem? Is
there a preffered way to restrict the master/login resources, or
actions, to the regular users ?
We restrict user limits in /etc/security/limits.conf so users can't run
very long or very big tasks on the login nodes:

# Normal user limits
* hard cpu 20
* hard rss 50000000
* hard data 50000000
* soft stack 40000000
* hard stack 50000000
* hard nproc 250

/Ole





Reply via email to