We've hit this before several times. The tricks we've used to deal with this are:

1. Being on the latest release: A lot of work has gone into improving RPC throughput, if you aren't running the latest 20.11 release I highly recommend upgrading.  20.02 also was pretty good at this.

2. max_rpc_cnt/defer: I would recommend using either of these settings for SchedulerParameters as it will allow the scheduler more time to breathe.

3. I would make sure that your mysql settings are set such that your DB is fully cached in memory and not hitting disk.  I also recommend running your DB on the same server as you run your ctld.  We've found that this can improve throughput.

4. We put a caching version of squeue in place which gives almost live data to the users rather than live data.  This additional buffer layer helps cut down traffic.  This is something we rolled in house with a database that updates every 30 seconds.

5. Recommend to users to submit jobs that last for more than 10 minutes and to use Job arrays instead of looping sbatch.  This will reduce thrashing.

Those are my recommendations for how to deal with this.

-Paul Edmon-

On 2/9/2021 7:59 PM, Kota Tsuyuzaki wrote:
Hello guys,

In our cluster, sometimes new incoming member accidentally creates too many 
slurm RPC calls (sbatch, sacct, etc), then slurmctld,
slurmdbd, and mysql may be overloaded.
To prevent such a situation, I'm looking for something like RPC Rate Limit for 
users. Does Slurm supports such a RateLimit feature?
If not, is there way to save Slurm server-side resources?

Best,
Kota

--------------------------------------------
露崎 浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki...@hco.ntt.co.jp
NTTソフトウェアイノベーションセンタ
分散処理基盤技術プロジェクト
0422-59-2837
---------------------------------------------





Reply via email to