Re: [Beowulf] Scheduler question -- non-uniform memory allocation to MPI

Prentice Bisbal Thu, 30 Jul 2015 12:49:07 -0700

Considering how widely WRF is used, I'm surprised it has this problem.Unfortunately, I can't offer you any advice. I've installed it once ortwice for others to use, but that's the limit of my experience with it.

It sounds like your users are being penny-wise and pound-foolish. If anon-exclusive job starts sooner but eventually fails and has to be runagain, is it really a faster time to solution than the job that has towait a little longer in the queue, but succeeds the first time? I assumeyou're using a backfill scheduler and require wallclock times withsubmissions. If so, do you know how accurate your users wallclock timesare? That could delay job starts in exclusive mode, too.


Sounds like to might need to 'educate' your users a bit.

Prentice

On 07/30/2015 02:51 PM, Tom Harvill wrote:

Hi Prentice,
Thank you for your reply. Yes, it's 'bad' code. It's WRF mostly. Ifyou have suggestions for that app I'mall ears. We don't control the code-base. We're also not allowed toupdate it except between projects
which is very infrequent.
It would be ideal if we could discretely control memory allocations toindividual processes withina job but I don't expect it's possible. I wanted to reach out to thislist of experts in case we might be
missing something.
The resistance comes from increased wait times as a result ofstaggered serial jobs that preventallocations within a node exclusively. Yes, the users would probablyget better aggregate turnaround
time if they waited for node exclusivity...

...Tom

On 7/30/2015 1:37 PM, Prentice Bisbal wrote:
Tom,
I don't want to be 'that guy', but it sounds like the root-cause ofthis problem is the programs themselves. A well-written parallelprogram should balance the workload and data pretty evenly across thenodes. Is this software written by your own researchers, open-source,or a commercial program? In my opinion, your efforts would be betterspent fixing the program(s), if possible, than finding a schedulerwith the feature you request, which I don't think exists.
If you can't fix the software, I think you're out of luck.
I was going to suggest requesting exclusive use of nodes (whole-nodeassignment) the easiest solution. What is the basis for the resistance?
Prentice

On 07/30/2015 11:34 AM, Tom Harvill wrote:
Hi,
We run SLURM with cgroups for memory containment of jobs. When usersrequestresources on our cluster many times they will specify the number of(MPI) tasks andmemory per task. The reality of much of the software that runs isthat most of thememory is used by MPI rank 0 and much less on slave processes. Thisis wasteful
and sometimes causes bad outcomes (OOMs and worse) during job runs.
AFAIK SLURM is not able to allow users to request a different amountof memoryfor different processes in their MPI pool. We used to runMaui/Torque and I'm fairly
certain that feature is not present in that scheduler either.
Does anyone know if any scheduler allows the user to requestdifferent amounts ofmemory per process? We know we can move to whole-node assignment toremedy
this problem but there is resistance to that...

Thank you!
Tom

Tom Harvill
Holland Computing Center
hcc.unl.edu
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by PenguinComputingTo change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Scheduler question -- non-uniform memory allocation to MPI

Reply via email to