Shawn,
Just to give you a compare and contrast:
We have for related entries slurm.conf
JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually
JobAcctGatherFrequency=30
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
cgroup_allowed_devices_file.conf:
/dev
Hi,I attached our cgroup.conf and gres.conf. As for the cgroup_allowed_devices.conf file, I have this file stubbed but empty. In 17.02 slurm started fine without this file (as far as I remember) and it being empty doesn’t appear to actually impact anything… device availability remains the same.
*Caedite eos. Novit enim Dominus qui sunt eius*
https://en.wikipedia.org/wiki/Caedite_eos._Novit_enim_Dominus_qui_sunt_eius.
I have been wanting to use that line in the context of batch systems and
users for ages.
At least now I can make it a play on killing processes. Rather than being
put on a
On Monday, 23 April 2018 11:58:56 PM AEST Paul Edmon wrote:
> I would recommend putting a clean up process in your epilog script.
Instead of that I'd recommend using cgroups to constrain processes to the
resources they have requested, it has the useful side effect of being able to
track all chi
I would recommend putting a clean up process in your epilog script. We
have a check here that sees if the job completed and if so it then
terminates all the user processes by kill -9 to clean up any residuals.
If it fails it closes of the node so we can reboot it.
-Paul Edmon-
On 04/23/2018
Nicolo, I cannot say what your problem is.
However in the past with problems like this I would
a) look at ps -eaf --forest
Try to see what the parent processes of these job processes are
Clearly if the parent PID is 1 then --forest is nto much help. But the
--forest option is my 'goto' option
On Sunday, 22 April 2018 12:55:46 PM AEST Mahmood Naderan wrote:
> I think that will limit other nodes to 20 too. Isn't that?
>
> Currently computes have 32 cores per node and I want all 32 cores. The head
> node also has 32 core but I want to include only 20 cores.
Apologies, I misunderstood wh
Hi,
I have a job that keeps running even though the internal process is
finished.
What could be the problem?
Thank you.
On Sunday, 22 April 2018 4:06:56 PM AEST Mahmood Naderan wrote:
> I ran some other tests and got the nearly the same results. That 4
> minutes in my previous post means about 50% overhead. So, 24000
> minutes on direct run is about 35000 minutes via slurm.
That sounds like there's really somethin