On Monday 08 October 2007, Olli-Pekka Lehto wrote: > Hello, > > I'm interested in hearing some best practice solutions in fine-grained > management memory resources on clusters of SMPs. How do you enforce real > memory usage inside (RHEL-based) cluster nodes running multiple serial > jobs simultaneously? More specifically, how to do this efficiently when > some of the jobs map copious amounts of virtual memory but have only a > fraction of it resident at any given time? > > As SMP systems keep getting constantly fatter (and the potential for > users interfering with each others' jobs increasing) it would be great > to have something like AIX's WLM (Workload Manager) on Linux to > effectively manage intra-SMP resources. > > Olli-Pekka
Ah, a family of issues near to my heart. :) I'll ask a broader question: How do you enforce real memory usage in modern Linux *at all*? We were interested in this because we were having user jobs regularly cause nodes to go into an Out Of Memory (OOM) state, triggering the kernel's oom_killer. The oom_killer sometime would kill system processes, which sometimes caused subsequent jobs to die. Even if subsequent jobs didn't die, recovery required that we manually close the node, reboot it when running jobs finished, then reopen it. This gets to be pretty dreary after a while. Our problem is somewhat different from your interests, but some of the same issues come into play. See below for the partially satisfying solution that we put in place for our OOM woes. First a review of the problem landscape as I understand it. You can try to enforce memory limits with a daemon, but you risk missing important events, including a badly behaved process suddenly using a whole lot of memory all at once. If that happens, your daemon is nearly useless since swapping and/or oom_killer will be running, and not your daemon. Your node may lock up for a while, which was what the daemon was supposed to prevent. I think you really want to do it in the kernel, so that badly behaved requests for memory (allocation and/or writing) can be cut off before they affect anyone else. But the kernel doesn't really enforce anything useful. It doesn't enforce a resident set size (RSS) limit, even though setrlimit() will let you request such a limit. As I understand it, modern Linux doesn't even try to track RSS, because semantics of RSS are unclear given modern memory management methods. RSS probably isn't even what you want -- you probably want to limit the amount of physical memory used, keeping the sum of the limits around the amount of total RAM, to avoid swapping. There is no way to communicate this limit to the kernel; I suspect it doesn't even track it except globally. The kernel *is* able to enforce the amount of virtual memory allocated per process (set with setrlimit()), but as you noted, that is of limited value when different applications can have very different overcommit percentages (virtual memory allocated beyond the amount actually used). But take a step back from considering the limits you can place on a given process. You probably want a policy that limits memory use at the job level, not at the process level, regardless of whether you have one job or multiple jobs running on a node. There is no kernel mechanism for that either. Seems your best bet might be to write a daemon, and hope that actual use patterns don't cause swapping or OOM before the daemon can act. To end our OOM problems, we took a different route. The job launch mechanism (via LSF) sets the per-process virtual-memory-allocation limit on each user job process. We can prevent OOM this way, unless a job both uses non-standard job launch methods and has runaway memory use (which is rare in our experience). Other weaknesses of our method include: * It does not prevent heavy swapping (which would be nice to have, but at least the user suffers the consequences most). * It can prevent a job from using all available RAM if the job has a larger overcommit than our algorithm assumes. * When the VM allocation limit is reached, the errors are often cryptic. Nothing appears in syslog (unlike segfaults, which are logged at least on x86_64) -- the kernel patch to enable logging seems likely pretty trivial, but stock kernels don't do it. A malloc() will return ENOMEM, which many programs and libraries don't handle properly (or indeed handle at all -- how many programmers omit checking the return value or errno?), so the user doesn't get a useful error message. A failed stack expansion will cause a segfault (as I recall), which is also cryptic to the user. At least segfaults get logged... I'd love to hear other approaches to this family of problems. David _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf