That does sound interesting, but more for some of my personal projects. It wouldn't work for the situation at hand because: 1) It sounds like it introduces a SPF (the head node). 2) Giving our developers cluster-wide 'killall' & 'kill' functionality makes me cringe. Most of them only know just enough about Linux to be dangerous. 3) It would require completely reworking our current cluster solution; a daunting task to say the least. 4) There isn't much love for commercial & non-OSS software at our company.
On 11/30/08, Donald Becker <[EMAIL PROTECTED]> wrote: > On Wed, 26 Nov 2008, Thomas Vixel wrote: > >> I've been googling for a top-like cli tool to use on our cluster, but >> the closest thing that comes up is Rocks' "cluster top" script. That >> could be tweaked to work via the cli, but due to factors beyond my >> control (management) all functionality has to come from a pre-fab >> program rather than a software stack with local, custom modifications. >> >> I'm sure this has come up more than once in the HPC sector as well -- >> could anyone point me to any top-like apps for our cluster? > > Most remote job mechanisms only think about starting remote processes, not > about the full create-monitor-control-report functionality. > > The Scyld system (currently branded "Clusterware") defaults to using a > built-in unified process space. That presents all of the processes > running over the cluster in a process space on the master machine, with > fully POSIX semantics. It neatly solves your need with... the standard > 'top' program. > > Most scheduling systems also have a way to monitor processes that they > start, but I haven't found one that takes advantage of all information > available and reports it quickly/efficiently. > > There are many advantages of the Scyld implementation > -- no new or modified process management tools need to be written. > Standard utilities such as 'top' and 'ps' work unmodified, > as well as tools we didn't specifically plan for e.g. GUI versions of > 'pstree'. > -- The 'killall' program works over the cluster, efficiently. > -- All signals work as expected, including 'kill -9'. (Most remote > process starting mechanisms will just kill off the local endpoint, > leaving the remote process running-but-confused.) > -- Process groups and controlling-TTY groups works properly for job > control and signals > -- Running jobs report their status and statistics accurately -- an > updated 'rusage' structure is sent once per second, and a final > rusage structure and exit status is sent when the process terminates. > > The "downside" is that we explicitly use Linux features and details, > relying on kernel-version-specific features. That's an issue if it's a > one-off hack, but we've been using this approach continuously for > a decade, since the Linux 2.2 kernel and over multiple > architectures. We've been producing supported commercial releases > since 2000, longer than anyone else in the business. > > -- > Donald Becker [EMAIL PROTECTED] > Penguin Computing / Scyld Software > www.penguincomputing.com www.scyld.com > Annapolis MD and San Francisco CA > > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf