On 2/9/21 5:08 pm, Paul Edmon wrote:

1. Being on the latest release: A lot of work has gone into improving RPC throughput, if you aren't running the latest 20.11 release I highly recommend upgrading.  20.02 also was pretty good at this.

We've not gone to 20.11 on production systems yet, but I can vouch for 20.02 being far better than previous versions for scheduling performance.

We also use the cli_filter lua plugin to write our own RPC limiting mechanism using a local directory for per-user files. The big advantage of this is that it does the rate limiting client side and so they don't get sent to the slurmctld in the first place. Yes, it is theoretically possible for users to discover and work around this, but the intent here is to catch accidental/naive use rather than anything malicious.

Also getting users to use `sacct` rather than `squeue` to check what state a job is in can help a lot too, it reduces the load on slurmctld.

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to