‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐

Le mercredi 12 janvier 2022 à 18:45, John R Anderson <j...@unr.edu> a écrit :

> hello, a user has requested that we set MaxStepCount to "unlimited" or 
> 16million to accommodate some of their desired workflows. i searched around 
> for details about this parameter & don't see alot, and i reviewed  
> https://bugs.schedmd.com/show_bug.cgi?id=5722
>
> any thoughts on this? can this successfully be applied to a partition or 
> individual nodes only? i wonder about log files exploding or worse...

I think one bottleneck here could be accounting and SlurmDBD, if you are using 
it. One step is one record in the step table of the SQL database. If you end up 
with hundreds of millions of records in the SQL table, you might experience 
weird issues with eg. archives or sreport. Mind that Slurm major version 
upgrades may come with database schema changes, and it could take a big amount 
of time (like several hours) with this order of magnitude.

Considering the total number of steps, I suspect this user may also generate 
big throughput of steps as well. At some point, slurmctld might need some 
specific tuning to handle it gracefully [1].

[1] https://slurm.schedmd.com/high_throughput.html

--
Rémi Palancher
Rackslab: Open Source Solutions for HPC Operations
https://rackslab.io

Reply via email to