On Thu, 30 Jul 2015 at 11:34 -0000, Tom Harvill wrote: > We run SLURM with cgroups for memory containment of jobs. When > users request resources on our cluster many times they will specify > the number of (MPI) tasks and memory per task. The reality of much > of the software that runs is that most of the memory is used by MPI > rank 0 and much less on slave processes. This is wasteful and > sometimes causes bad outcomes (OOMs and worse) during job runs.
I'll note that this problem also can occur in Grid Engine and OpenMPI. We would get user reports of random job failures. Sometimes the job would run and other times it would fail. We normally run allowing shared node access and the cases I've seen with problems were with a highly fragmented cluster with tasks spread 1-2 per node. Having the job request exclusive nodes (8 cores) was generally enough to consolidate the qrsh processes from ~200 to ~50 which provided enough headroom on the master process. The times I've observed have been due to the MPI startup process which spawns a qrsh/ssh login from the master node to each of the slave nodes (multiple MPI ranks on a slave share the same qrsh connection). The memory for all of these qrsh processes on the master node can eventually add up to be enough to cause out of memory conditions. This "solution" (workaround) has been good enough for our impacted users so far. Eventually without other changes this problem will return and not have as simple a solution. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf