----- "Kilian CAVALOTTI" <[EMAIL PROTECTED]> wrote: > Hi Chris,
Hello Kilian, > On Tuesday 12 August 2008 08:29:31 pm Chris Samuel wrote: > > > We do use things like cpusets to try and limit the impact > > that jobs can have on other jobs on the same nodes, > > I'm actually curious about how you implemented that. Not a problem. > Do you have NUMA hardware? Yes, the cluster we're using this on has dual quad core Barcelona CPUs and 32GB RAM per node (to get it to the 4GB/core level). It's running CentOS 5 with the mainline kernel. > Do you use a resources manager, and is the cpusets creation > process integrated with it? We are using Torque (an open source PBS derivative) and that has built in cpusets support. It previously had some support for the older SGI Altix cpusets but that has now been replaced with support for the 2.6 kernel implementation (which itself has now been pulled into the more generic cgroups work). The 2.6 cpuset support in Torque came out of a long discussion between Garrick Staples and myself at SC'07 where we nutted out the basic design and Garrick then did the hard work of implementing it. > How do you manage concurrent jobs running on the same > machine: do you pin them on specific CPUs and keep track > of what CPU is busy and which is not, or do you have a > way to just limit the number of CPUs they're using? There are two major assumptions in the current Torque code: 1) There is a direct mapping between Torque's concept of vnodes (cpus) and cores. I.e. if you have told Torque a node has 8 cpus then it has 8 cores to bind to. 2) The cpus are contiguous and start at 0. So if you are using a boot cpuset then it's best to reserve the *last* core in the box for that and not the first. You will also need to tell Torque that the node has N-1 cpus. The design is sort of hierarchical: 1) A top level "torque" cpuset is created by the pbs_mom when it starts if it does not already exist. It adds all the cpus and mems into it. 2) When a job is scheduled onto the node(s) the pbs_mom creates a job cpuset which includes the specific cpus (vnodes) that have been allocated by the scheduler, and all the mems present (it currently makes no attempt to be clever about that). 3) Prior to the 2.3.2 release there was a per vnode (core) cpuset created within the job cpuset and then processes launched via the PBS tm_spawn interface by tools like Pete Wyckoff's mpiexec would get locked to a core. Great in theory, but... That's been changed now to just put processes in the job cpuset as MPI tools like OpenMPI's mpiexec only make a single tm_spawn call *per node* and then fork the MPI processes from that so you would end up with all the processes of an OpenMPI job locked to a single core with the old code. This still leaves issues for codes that use rsh/rsh based MPI launchers but we're playing around with a drop in script that makes it do the right thing using pbsdsh instead. > As you can guess, I'd be interested in some technical details. :) Hope that's useful! We also have an init script that does: mkdir /dev/cpuset mount -t cpuset none /dev/cpuset to make sure the cpuset VFS is there on boot. Tangent: Linux cpusets were how we found that the noacpi boot option broke the kernels detection of NUMA capabilities [1] on Barcelona as /dev/cpuset/mems only had "0" in it, not "0-1" as it should have had! [1] - it first tries a K8 specific hack and then uses ACPI, so for K10 no ACPI - no NUMA. ;-) cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf