Hi,
Can someone help me understand what this error is?
select/cons_res: node cn95 memory is under-allocated (125000-135000) for
JobId=23544043
We get a lot of these from time to time and I don't understand what its about?
Looking at the code it doesn't make sense for this to be happening on ru
HI Brian,
Thanks a lot for your recommendations.
I'll do my best to address your three points in line. I hope I've
understood you correctly, please correct me if i've misunderstood parts.
"1) If you can, either use xargs or parallel to do the forking so you can
limit the number of simultaneous s
Here is where you may want to look into slurmdbd and sacct
Then you can create a qos that has MaxJobsPerUser to limit the total
number running on a per-user basis:
https://slurm.schedmd.com/resource_limits.html
Brian Andrus
On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote:
Hi Paul
Just a couple comments from experience in general:
1) If you can, either use xargs or parallel to do the forking so you can
limit the number of simultaneous submissions
2) I have yet to see where it is a good idea to have many separate jobs
when using an array can work.
If you can prep
Hi Paul,
Your comment confirms my worst fear, that I should either implement job
arrays or stick to a sequential for loop.
My problem with job arrays is that, as far as I understand them, they
cannot be used with singleton to set a max job limit.
I use singleton to limit the number of jobs a use
Thanks Ole for giving so much thought into my question. I'll pass a long
these suggestions. Unfortunately as a user there's not a whole lot I can do
about the choice of hardware.
Thanks for the link to the guide, I'll have a look at it. Even as a user
it's helpful to be well informed on the admin
At least for our cluster we generally recommend that if you are
submitting large numbers of jobs you either use a job array or you just
for loop over the jobs you want to submit. A fork bomb is definitely
not recommended. For highest throughput submission a job array is your
best bet as in on
Hi all,
I'm still puzzled by the expected behaviour of the following:
$ sbatch --hold fakejob.sh
Submitted batch job 25909273
$ sbatch --hold fakejob.sh
Submitted batch job 25909274
$ sbatch --hold fakejob.sh
Submitted batch job 25909275
$ scontrol update jobid=25909273 Dependency=singleton
$ scon
Hi Guillaume,
The performance of the slurmctld server depends strongly on the server
hardware on which it is running! This should be taken into account when
considering your question.
SchedMD recommends that the slurmctld server should have only a few, but
very fast CPU cores, in order to e
Hi Paul,
Thanks a lot for your suggestion.
The cluster I'm using has thousands of users, so I'm doubtful the admins
will change this setting just for me. But I'll mention it to the support
team I'm working with.
I was hoping more for something that can be done on the user end.
Is there some way
10 matches
Mail list logo