Dan, completely off topic here. May I ask what type of simulations are you running? Clearly you probably have a large investment in time in Trick. However as a fan of Julia language let me leave this link here: https://juliaobserver.com/packages/RigidBodyDynamics
On 5 March 2018 at 07:31, John Hearns <hear...@googlemail.com> wrote: > I completely agree with what Chris says regarding cgroups. Implement > them, and you will not regret it. > > I have worked with other simulation frameworks, which work in a similar > fashion to Trick, ie a master process which spawns > off independent worker processes on compute nodes. I am thinking on an > internal application we have, and if I also say it Matlab. > > In the Trick documentation: > <https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes>Notes > > 1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to launch > slaves across the network > 2. Each slave machine will work in parallel with other slaves, greatly > reducing the computation time of a simulation > > However I must say that there must be plenty of folks at NASA who use this > simulation framework on HPC clusters with batch systems. > It would surprise me that there are not 'adapation layers' available for > Slurm, SGE, PBS etc. > So in SLurm, you would do an sbatch which would reserve your worker nodes > then run a series of srun which run the worker processes. > > (I hope I have that round the right way - I seem to recall doing srun then > a series of sbatches in the past) > > But looking at the Trick Wiki quickly, I am wrong. It does seem to work on > the model of "get a list of hosts allocated by your batch system"", > ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which > spwan off models using ssh. > Looking at the Advanced Topics guide this does seem to be so: > https://github.com/nasa/trick/blob/master/share/doc/trick/ > Trick_Advanced_Topics.pdf > The model is that you allocate up to 16 remote worker hosts for a long > time. Then various modelling tasks are started on those hosts via ssh. > Trick expects those hosts to be available for more tasks during your > simulation session. > Also there is discussion there about turning off irqbalance and cpuspeed, > and disabling non necessary system services. > > > > > As someone who has spent endless oodles of hours either killing orphaned > processes on nodes, or seeing rogueprocess alarms, > or running ps --forest to trace connections into batch job nodes which > bypass the pbs/slurm daemons I despair slightly... > I am probably very wrong, and NASA have excellent slurm integration. > > So I agree with Chris - implement cgroups, and try to make sure your ssh > 'lands'on a cgroup. > 'lscgroup' is a nice command to see what cgroups are active on a compte > node. > Also run an interactive job, ssh into one of your allocated workr nodes, > then cat /proc/self/cgroups shows which cgroups you have landed iin. > > > > > > > > > > > > > > > On 5 March 2018 at 02:20, Christopher Samuel <ch...@csamuel.org> wrote: > >> On 05/03/18 12:12, Dan Jordan wrote: >> >> What is the /correct /way to clean up processes across the nodes >>> given to my program by SLURM_JOB_NODELIST? >>> >> >> I'd strongly suggest using cgroups in your Slurm config to ensure that >> processes are corralled and tracked correctly. >> >> You can use pam_slurm_adopt from the contrib directory to capture >> inbound SSH sessions into a running job on the node (and deny access to >> people who don't). >> >> Then Slurm should take care of everything for you without needing an >> epilog. >> >> Hope this helps! >> Chris >> >> >