Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Kevin Manalo
Shawn, Just to give you a compare and contrast: We have for related entries slurm.conf JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually JobAcctGatherFrequency=30 ProctrackType=proctrack/cgroup TaskPlugin=task/affinity,task/cgroup cgroup_allowed_devices_file.conf: /dev

Re: [slurm-users] Jobs escaping cgroup device controls after some amount of time.

2018-04-23 Thread Shawn Bobbin
Hi,I attached our cgroup.conf and gres.conf.  As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty.  In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same.  

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread John Hearns
*Caedite eos. Novit enim Dominus qui sunt eius* https://en.wikipedia.org/wiki/Caedite_eos._Novit_enim_Dominus_qui_sunt_eius. I have been wanting to use that line in the context of batch systems and users for ages. At least now I can make it a play on killing processes. Rather than being put on a

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread Chris Samuel
On Monday, 23 April 2018 11:58:56 PM AEST Paul Edmon wrote: > I would recommend putting a clean up process in your epilog script. Instead of that I'd recommend using cgroups to constrain processes to the resources they have requested, it has the useful side effect of being able to track all chi

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread Paul Edmon
I would recommend putting a clean up process in your epilog script.  We have a check here that sees if the job completed and if so it then terminates all the user processes by kill -9 to clean up any residuals. If it fails it closes of the node so we can reboot it. -Paul Edmon- On 04/23/2018

Re: [slurm-users] Job still running after process completed

2018-04-23 Thread John Hearns
Nicolo, I cannot say what your problem is. However in the past with problems like this I would a) look at ps -eaf --forest Try to see what the parent processes of these job processes are Clearly if the parent PID is 1 then --forest is nto much help. But the --forest option is my 'goto' option

Re: [slurm-users] Include some cores of the head node to a partition

2018-04-23 Thread Chris Samuel
On Sunday, 22 April 2018 12:55:46 PM AEST Mahmood Naderan wrote: > I think that will limit other nodes to 20 too. Isn't that? > > Currently computes have 32 cores per node and I want all 32 cores. The head > node also has 32 core but I want to include only 20 cores. Apologies, I misunderstood wh

[slurm-users] Job still running after process completed

2018-04-23 Thread Nicolò Parmiggiani
Hi, I have a job that keeps running even though the internal process is finished. What could be the problem? Thank you.

Re: [slurm-users] Slurm overhead

2018-04-23 Thread Chris Samuel
On Sunday, 22 April 2018 4:06:56 PM AEST Mahmood Naderan wrote: > I ran some other tests and got the nearly the same results. That 4 > minutes in my previous post means about 50% overhead. So, 24000 > minutes on direct run is about 35000 minutes via slurm. That sounds like there's really somethin