Re: [slurm-users] Using free memory available when allocating a node to a job

Brian Andrus Tue, 29 May 2018 10:42:37 -0700

On thing that seems concerning to me is that you may start a job on anode before a currently running job has 'expanded' as much as it will.

If there is 128G on the node and current job is using 64G but willeventually use 112G, your approach could start another similar job andthey would both start swapping.

We had always pushed the users to know what they need before they submita job. They can ask for too much and then go down from there, but it isreally their responsibility to know what their program will do. You aregiving them the keys to a Tesla and they want to blame you if they putthe pedal to the metal and crash. Learn the tools before you use them.


Brian Andrus



On 5/29/2018 6:56 AM, PULIDO, Alexandre wrote:

Thanks for your inputs, the automatic reporting is definitely a greatidea and seems easy to implement in Slurm. At our site we have a webportal developed internally where users can see in real timeeverything that is happening on the cluster, and every metric of theirown job. There is especially a color code regarding theunder/overestimation of memory allocation.
We have constraints, we cannot afford loosing time killing jobs, orperformance if a 16G job is allocated to a node where there is only 4left.
In PBS taking into account the actual free memory as a resource forallocation is a great way to handle this. I find it too bad not to useSlurm’s allocation algorithms and develop another, hacky one with“numerical features” per node.
I’ll admit I’m not comfortable enough editing the cons_res pluginsource code, but there doesn’t seem to be another way around for thisneed.
Regards,

Alexandre
*De :*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *Dela part de* John Hearns
*Envoyé :* mardi 29 mai 2018 13:16
*À :* Slurm User Community List
*Objet :* Re: [slurm-users] Using free memory available whenallocating a node to a job
Alexandre, you have made a very good point here. "Oftentimes usersonly input 1G as they really have no idea of the memory requirements,"
At my last job we introduced cgroups. (this was in PBSPro). We had toenforce a minumum request for memory.
Users then asked us how much memory their jobs used - so that theycould request an amoutn of memory next time which would let the jobrun to completion.
We were giving users information manually regarding how much memorytheir jobs used.
I realise tha tthe tools are there for users to get the information onmemory usage after a job, but I really do not expec tusrs to have tofigure this out.
What do other sites do in this case?
On 29 May 2018 at 12:57, PULIDO, Alexandre<alexandre.pulido@ariane.group <mailto:alexandre.pulido@ariane.group>>wrote:
Hello John, this behavior is needed because the memory usage of thecodes executed on the nodes are particularly hard to guess. Usually,when exceeded the ratio is between 1.1 and 1.3 more than expected.Sometimes much larger.
A)Indeed there is a partition running only exclusive jobs, but a largeamounts of nodes are also needed working on an nonexclusiveallocation. That’s why the exact amount of available memory isrequired in this configuration. Tasks are not killed if they take morethan allocated.
B)Yes currently cgroup is configured and working as expected (Ibelieve), but as I said tasks need to grow larger.
Oftentimes users only input 1G as they really have no idea of thememory requirements, and with the high demand of HPC time a lowermemory requirement is set so the job will start.
So a job cannot be started on a node where another job would befilling up the RAM, and would start on another node.
Would this behavior cause problems in the scheduling/allocationalgorithms ? The way I see it the actual free memory would be justanother consumable resource.
But the only way I can see this working is by tweaking the plugin,correct ?
Thank you for your inputs.
*De :*slurm-users [mailto:slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>] *De la part de* JohnHearns
*Envoyé :* mardi 29 mai 2018 12:39
*À :* Slurm User Community List
*Objet :* Re: [slurm-users] Using free memory available whenallocating a node to a job
Also regarding memory, there are system tunings you can set for thebehaviour of the OurOfMemory Killer and also the VM overcommit.
I have seen the VM overcommit parameters being discussed elsewhere,and generally for HPC people advise to disable overcommit
https://www.suse.com/support/kb/doc/?id=7002775
This of course is very dependent on what your environment andapplications are. Would you be able to say please what the problemsyou are having with memory?
On 29 May 2018 at 12:26, John Hearns <hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote:
Alexandre, it would be helpful if you could say why this behaviour isdesirable.
For instance, do you have codes which need a large amount of memoryand your users are seeing that these codes are crashing because othercodes running on the same nodes are using memory.
I have two thoughts:
A) enable job exclusive - ie run one job on one compute node. Thenthat job has all the memory.
This is a very good way to run HPC in my experience. Yes I know it isinefficient if there are lots of single core jobs.
SO this depends on what your mix of jobs is.
B) Have you considered implementing cgroups? Then each job will beallocated memory and cpu cores.
Jobs will not be able to grow larger than their allocated cgroup limits.

I would really ask you to consider cgroups.
On 29 May 2018 at 11:34, PULIDO, Alexandre<alexandre.pulido@ariane.group <mailto:alexandre.pulido@ariane.group>>wrote:
Hi,
in the cluster where I'm deploying Slurm the job allocation has to bebased on the actual free memory available on the node, not just theallocated by Slurm. This is nonnegotiable and I understand that it'snot how Slurm is designed to work, but I'm trying anyway.
Among the solutions that I'm envisaging:
1) Create and update periodically a numerical node feature, with astring and a special character separating the wanted value(memfree_2048). This definitely seems like a mess to implement and toohacky, but is there an equivalent to PBS' numerical complexes andsensors in Slurm?
2) Modifying the select cons_res pluging to compare against the actualfree memory instead of the allocated memory. Is it as simple asediting the "_add_job_to_res"(https://github.com/SchedMD/slurm/blob/master/src/plugins/select/cons_res/select_cons_res.c#L816)function and using the real left memory ? I don't want to breakanything else so that's my main question here, if you can guide metowards the solution or other thoughts on its feasibility.
Thanks a lot in advance!

Best regards,



px

        

px

*Alexandre PULIDO*

px

arianegroup

px

        

px

px
Ce courriel (incluant ses éventuelles pièces jointes) peut contenirdes informations confidentielles et/ou protégées ou dont la diffusionest restreinte ou soumise aux règlementations relatives au contrôledes exportations ou ayant un caractère privé. Si vous avez reçu cecourriel par erreur, vous ne devez ni le reproduire, ni l'utiliser, nien divulguer le contenu à quiconque. Merci d'en avertir immédiatementl'expéditeur et de supprimer de votre système informatique ce courrielainsi que tous les documents qui y sont attachés. Toute exportation ouréexportation non autorisée est interdite. ArianeGroup SAS déclinetoute responsabilité en cas de corruption par virus, d'altération oude falsification de ce courriel lors de sa transmission par voieélectronique. This email (including any attachments) may containconfidential or proprietary and/or privileged information orinformation otherwise protected from disclosure or may be subject toexport control laws and regulations. If you are not the intendedrecipient, please notify the sender immediately, do not reproduce thismessage or any attachments and do not use it for any purpose ordisclose its content to any person, but delete this message and anyattachments from your system. Unauthorized export or re-export isprohibited. ArianeGroup SAS disclaims any and all liability if thisemail transmission was virus corrupted, altered or falsified.ArianeGroup SAS (519 032 247 RCS PARIS) - Capital social : 265 904 408EUR - Siège social : Tour Cristal, 7-11 Quai André Citroën, 75015Paris<https://maps.google.com/?q=7-11+Quai+Andr%C3%A9+Citro%C3%ABn,+75015+Paris&entry=gmail&source=g>- TVA FR 82 519 032 247 - APE/NAF 3030Z
Ce courriel (incluant ses éventuelles pièces jointes) peut contenirdes informations confidentielles et/ou protégées ou dont la diffusionest restreinte ou soumise aux règlementations relatives au contrôledes exportations ou ayant un caractère privé. Si vous avez reçu cecourriel par erreur, vous ne devez ni le reproduire, ni l'utiliser, nien divulguer le contenu à quiconque. Merci d'en avertir immédiatementl'expéditeur et de supprimer de votre système informatique ce courrielainsi que tous les documents qui y sont attachés. Toute exportation ouréexportation non autorisée est interdite. ArianeGroup SAS déclinetoute responsabilité en cas de corruption par virus, d'altération oude falsification de ce courriel lors de sa transmission par voieélectronique. This email (including any attachments) may containconfidential or proprietary and/or privileged information orinformation otherwise protected from disclosure or may be subject toexport control laws and regulations. If you are not the intendedrecipient, please notify the sender immediately, do not reproduce thismessage or any attachments and do not use it for any purpose ordisclose its content to any person, but delete this message and anyattachments from your system. Unauthorized export or re-export isprohibited. ArianeGroup SAS disclaims any and all liability if thisemail transmission was virus corrupted, altered or falsified.ArianeGroup SAS (519 032 247 RCS PARIS) - Capital social : 265 904 408EUR - Siège social : Tour Cristal, 7-11 Quai André Citroën, 75015Paris<https://maps.google.com/?q=7-11+Quai+Andr%C3%A9+Citro%C3%ABn,+75015+Paris&entry=gmail&source=g>- TVA FR 82 519 032 247 - APE/NAF 3030Z
Ce courriel (incluant ses éventuelles pièces jointes) peut contenirdes informations confidentielles et/ou protégées ou dont la diffusionest restreinte ou soumise aux règlementations relatives au contrôledes exportations ou ayant un caractère privé. Si vous avez reçu cecourriel par erreur, vous ne devez ni le reproduire, ni l'utiliser, nien divulguer le contenu à quiconque. Merci d'en avertir immédiatementl'expéditeur et de supprimer de votre système informatique ce courrielainsi que tous les documents qui y sont attachés. Toute exportation ouréexportation non autorisée est interdite. ArianeGroup SAS déclinetoute responsabilité en cas de corruption par virus, d'altération oude falsification de ce courriel lors de sa transmission par voieélectronique. This email (including any attachments) may containconfidential or proprietary and/or privileged information orinformation otherwise protected from disclosure or may be subject toexport control laws and regulations. If you are not the intendedrecipient, please notify the sender immediately, do not reproduce thismessage or any attachments and do not use it for any purpose ordisclose its content to any person, but delete this message and anyattachments from your system. Unauthorized export or re-export isprohibited. ArianeGroup SAS disclaims any and all liability if thisemail transmission was virus corrupted, altered or falsified.ArianeGroup SAS (519 032 247 RCS PARIS) - Capital social : 265 904 408EUR - Siège social : Tour Cristal, 7-11 Quai André Citroën, 75015Paris - TVA FR 82 519 032 247 - APE/NAF 3030Z

Re: [slurm-users] Using free memory available when allocating a node to a job

Reply via email to