Dan, thankoyu very much for a comprehensive and understandable reply. On 5 March 2018 at 16:28, Dan Jordan <ddj...@gmail.com> wrote:
> John/Chris, > > Thanks for your advice. I'll need to do some reading on cgroups, I've > never even been exposed to that concept. I don't even know if the SLURM > setup I have access to has the cgroups or PAM plugin/modules > enabled/available. Unfortunately I'm not involved in the administration of > SLURM, I'm simply a user of a much larger system that's already established > with other users doing compute tasks completely separate from my use case. > Therefore, I'm most interested in solutions that I can implement without > sys admin support on the SLURM side, which is why I started looking at the > --epilog route. > > I neither have the administrator access to SLURM nor the time to consider > more complex approaches that might hack the Trick architecture. *Literally > the only thing that isn't working for me right now is the cleanup mechanism*, > everything else is working just fine. It's not as simple as killing all > the simulation spawned processes, the processes themselves create message > queues for internal communication that live in /dev/mqueue/ on each node, > and when the sim gets a kill -9 signal, there's no internal cleanup, and > those files linger on the filesystem indefinitely, causing issues in > subsequent runs on those machines. > > From my understanding, there's already a "master" epilog script that kills > all user processes implemented in our system after a user's job completes. > They have set up our SLURM nodes to be "reserved" for the user requesting > them, so their greedy cleanup script isn't a problem for other compute > processes, they are reserved for that single person. I might just ping the > administrators and ask them to also add an 'rm /dev/mqueue/*' to that > script, to me that seems like the fastest solution given what I know. I > would prefer to keep that part in the "user space" since it's very specific > to my use case, but srun --epilog is not behaving as I would expect. Can > y'all confirm what I'm seeing is indeed what is expected to happen? > > ssh: ssh machine001 > srun: srun --nodes 3 --epilog > *cleanup.sh myProgram.exe* > squeue: shows job 123 running on machine200, machine201, machine202 > Kill: scancel 123 > Result: myProgram.exe is terminated, cleanup.sh runs on machine001 > > I was expecting cleanup.sh to run on one (or all) of the compute nodes > (200-202), not on the machine I launched the srun command from (001). > > John -- Yes we are heavily invested in the Trick framework and use their > Monte-Carlo feature quite extensively, in the past we've used PBS to manage > our compute nodes, but this is the first attempt to integrate Trick > Monte-Carlo with SLURM. We do spacecraft simulation and analysis for > various projects. > > On Mon, Mar 5, 2018 at 12:36 AM, John Hearns <hear...@googlemail.com> > wrote: > >> Dan, completely off topic here. May I ask what type of simulations are >> you running? >> Clearly you probably have a large investment in time in Trick. >> However as a fan of Julia language let me leave this link here: >> https://juliaobserver.com/packages/RigidBodyDynamics >> >> >> On 5 March 2018 at 07:31, John Hearns <hear...@googlemail.com> wrote: >> >>> I completely agree with what Chris says regarding cgroups. Implement >>> them, and you will not regret it. >>> >>> I have worked with other simulation frameworks, which work in a similar >>> fashion to Trick, ie a master process which spawns >>> off independent worker processes on compute nodes. I am thinking on an >>> internal application we have, and if I also say it Matlab. >>> >>> In the Trick documentation: >>> <https://github.com/nasa/trick/wiki/UserGuide-Monte-Carlo#notes>Notes >>> >>> 1. SSH <https://en.wikipedia.org/wiki/Secure_Shell> is used to >>> launch slaves across the network >>> 2. Each slave machine will work in parallel with other slaves, >>> greatly reducing the computation time of a simulation >>> >>> However I must say that there must be plenty of folks at NASA who use >>> this simulation framework on HPC clusters with batch systems. >>> It would surprise me that there are not 'adapation layers' available for >>> Slurm, SGE, PBS etc. >>> So in SLurm, you would do an sbatch which would reserve your worker >>> nodes then run a series of srun which run the worker processes. >>> >>> (I hope I have that round the right way - I seem to recall doing srun >>> then a series of sbatches in the past) >>> >>> But looking at the Trick Wiki quickly, I am wrong. It does seem to work >>> on the model of "get a list of hosts allocated by your batch system"", >>> ie the SLURM_JOB_HOSTLIST then Trick will set up simulation queues which >>> spwan off models using ssh. >>> Looking at the Advanced Topics guide this does seem to be so: >>> https://github.com/nasa/trick/blob/master/share/doc/trick/Tr >>> ick_Advanced_Topics.pdf >>> The model is that you allocate up to 16 remote worker hosts for a long >>> time. Then various modelling tasks are started on those hosts via ssh. >>> Trick expects those hosts to be available for more tasks during your >>> simulation session. >>> Also there is discussion there about turning off irqbalance and >>> cpuspeed, and disabling non necessary system services. >>> >>> >>> >>> >>> As someone who has spent endless oodles of hours either killing orphaned >>> processes on nodes, or seeing rogueprocess alarms, >>> or running ps --forest to trace connections into batch job nodes which >>> bypass the pbs/slurm daemons I despair slightly... >>> I am probably very wrong, and NASA have excellent slurm integration. >>> >>> So I agree with Chris - implement cgroups, and try to make sure your >>> ssh 'lands'on a cgroup. >>> 'lscgroup' is a nice command to see what cgroups are active on a compte >>> node. >>> Also run an interactive job, ssh into one of your allocated workr nodes, >>> then cat /proc/self/cgroups shows which cgroups you have landed iin. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On 5 March 2018 at 02:20, Christopher Samuel <ch...@csamuel.org> wrote: >>> >>>> On 05/03/18 12:12, Dan Jordan wrote: >>>> >>>> What is the /correct /way to clean up processes across the nodes >>>>> given to my program by SLURM_JOB_NODELIST? >>>>> >>>> >>>> I'd strongly suggest using cgroups in your Slurm config to ensure that >>>> processes are corralled and tracked correctly. >>>> >>>> You can use pam_slurm_adopt from the contrib directory to capture >>>> inbound SSH sessions into a running job on the node (and deny access to >>>> people who don't). >>>> >>>> Then Slurm should take care of everything for you without needing an >>>> epilog. >>>> >>>> Hope this helps! >>>> Chris >>>> >>>> >>> >> > > > -- > Dan Jordan >