Tom,
I don't want to be 'that guy', but it sounds like the root-cause of this
problem is the programs themselves. A well-written parallel program
should balance the workload and data pretty evenly across the nodes. Is
this software written by your own researchers, open-source, or a
commercial program? In my opinion, your efforts would be better spent
fixing the program(s), if possible, than finding a scheduler with the
feature you request, which I don't think exists.
If you can't fix the software, I think you're out of luck.
I was going to suggest requesting exclusive use of nodes (whole-node
assignment) the easiest solution. What is the basis for the resistance?
Prentice
On 07/30/2015 11:34 AM, Tom Harvill wrote:
Hi,
We run SLURM with cgroups for memory containment of jobs. When users
request
resources on our cluster many times they will specify the number of
(MPI) tasks and
memory per task. The reality of much of the software that runs is
that most of the
memory is used by MPI rank 0 and much less on slave processes. This is
wasteful
and sometimes causes bad outcomes (OOMs and worse) during job runs.
AFAIK SLURM is not able to allow users to request a different amount
of memory
for different processes in their MPI pool. We used to run Maui/Torque
and I'm fairly
certain that feature is not present in that scheduler either.
Does anyone know if any scheduler allows the user to request different
amounts of
memory per process? We know we can move to whole-node assignment to
remedy
this problem but there is resistance to that...
Thank you!
Tom
Tom Harvill
Holland Computing Center
hcc.unl.edu
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf