Here we see this. There is a difference in behavior depending whether the
program runs out of the "standard" NFS or the GPFS filesystem.
If the I/O is from NFS, there can be conditions where we see this with some
frequency on a given problem. It will not be every time but can be reproduced.
I think cgroups is prob more elegant .. but here is another script
https://github.com/FredHutch/IT/blob/master/py/loadwatcher.py#L59
The email text is hard coded so please change before using. We put this in
place in Oct 2017 when things where getting out of control because folks were
I had previously contacted Ryan Cox about his solution and worked with
it a little to implement it on our CentOS 7 cluster. While I liked his
solution, I felt it was a little complex for our needs.
I'm a big fan of keeping stuff real simple, so I came up with two simple
shell scripts to solve
Manuel,
We set up cgroups and also do cputime limits (60 minutes in our case) in
limits.conf. Before libcgroup had support for more generic "apply to
each user" kind of thing, I created a pam module that handles all of
that which still works well for creating per-user limits. We also have
s
On Thursday, 15 February 2018, at 16:11:29 (+0100),
Manuel Rodríguez Pascual wrote:
> Although this is not strictly related to Slurm, maybe you can recommend me
> some actions to deal with a particular user.
>
> On our small cluster, currently there are no limits to run applications in
> the fron
Hi Manuel,
Manuel Rodríguez Pascual writes:
> Hi all,
>
> Although this is not strictly related to Slurm, maybe you can
> recommend me some actions to deal with a particular user.
>
> On our small cluster, currently there are no limits to run
> applications in the frontend. This is sometimes re
I've used this with some success:
https://github.com/JohannesBuchner/verynice. For CPU intensive things it
works great, but you have to also set some memory limits in limits.conf if
users do any large memory stuff. Otherwise I just use a problem process as
a chance to start a conversation with that
Hi Manuel,
A possible workaround is to configure a cgroups limit by user in the
frontend node so a single user cannot allocate more than 1GB of ram (or
whatever value you prefer). The user would still be able to abuse the
machine but as soon as his memory usage goes above the limit his job will
be
Every cluster I've ever managed has this issue. Once cgroup support arrived in
Linux, the path we took (on CentOS 6) was to use the 'cgconfig' and 'cgred'
services on the login node(s) to setup containers for regular users and
quarantine them therein. The config left 4 CPU cores unused by regu
We kick them off and lock them out until they respond. Disconnections are
common enough that it doesn’t always get their attention. Inability to log back
in always does.
Best,
Bill.
Sent from my phone.
> On Feb 15, 2018, at 9:25 AM, Patrick Goetz wrote:
>
> The simple solution is to tell p
We have an automated script, pcull which goes through and finds abusing
processes:
https://github.com/fasrc/pcull
-Paul Edmon-
On 02/15/2018 10:25 AM, Patrick Goetz wrote:
The simple solution is to tell people not to do this -- that's what I
do. And if that doesn't work threaten to kick them
The simple solution is to tell people not to do this -- that's what I
do. And if that doesn't work threaten to kick them off the system.
On 02/15/2018 09:11 AM, Manuel Rodríguez Pascual wrote:
Hi all,
Although this is not strictly related to Slurm, maybe you can recommend
me some actions to d
Hi all,
Although this is not strictly related to Slurm, maybe you can recommend me
some actions to deal with a particular user.
On our small cluster, currently there are no limits to run applications in
the frontend. This is sometimes really useful for some users, for example
to have scripts moni
13 matches
Mail list logo