Re: [Beowulf] picking out a job scheduler

Chris Dagdigian Tue, 02 Jan 2007 13:21:24 -0800


For what it's worth I'm a biased Grid Engine and Platform LSF user  ...


On Dec 29, 2006, at 11:40 AM, Nathan Moore wrote:

I've presently set up a cluster of 5 AMD dual-core linux boxes formy students (at a small college). I've got MPICH running, sharedNIS/NFS home directories etc. After reading the MPICH installationguide and manual, I can't say I understand how to deploy MPICH formy students to use. So far as I can tell, there no load balancingor migration of processes in the library, and so now I'm trying tofigure out what piece of software to add to the cluster to (forexample) prevent the starting of an MPI job when there's alreadyanother job running.
(1) Is openPBS or gridengine the appropriate tool to use for amulti-user system where mpich is available? Are there betterscheduling options?

Both should be fine although if you are considering *PBS you shouldlook at both Torque (a fork of OpenPBS I think) and PBSPro(commercial but last time I checked they had very good options foracademic sites). I can't speak intelligently about the PBS variantsthese days... it's been too long since I've been hands on.

Lots of people use Grid Engine with MPICH using both loose and tightintegration methods. The mailing list([EMAIL PROTECTED]) has a very helpful community with anexcellent signal to noise ratio.

Despite being an SGE zealot there are times when I can make both atechnical and business argument for why Platform LSF is the "best"solution for a particular project or problem -- you may want to addthis to your evaluation plate if you are considering (at all)commercial options. If not, don't sweat it. For a small cluster inan academic environment LSF may be hard to justify but if you can getgood academic pricing it is often worthwhile to crunch the numbers --LSF in some cases can 'win' from a features, lower-administrative-burden and support perspective but this a case-by-case thing.

(1.5) Can mortals install and configure Gridengine? Thus far itseems too wonderful for me to understand.

Grid Engine is easy to install. I've posted an article here thatcovers the stuff I wish someone had told me beforehand about SGE:


"Things to think about before installing Grid Engine"

http://gridengine.info/articles/2005/09/29/things-to-think-about-before-installing

... it boils down to the fact that during installation SGE isunusually sensitive to issues regarding hostnames and forward/reverseDNS resolution.

(2) Also, if my cluster is made up of a mix of single and dualprocessor machines, what's the proper way to tell mpd about thattopology?

Depends on which MPI implementation and which of the many availablemethods you are using to bootstrap the process.

(3) Its likely that in the future I'll have part-time access toanother cluster of dual-boot (XP/linux) machines. The machineswill default to booting to Linux, but will occasionally (5-20 hoursa week) be used as windows workstations by a console user (when auser is finished, they'll restart the machine and it will boot backto linux). If cluster nodes are available in this sort ofunpredictable and intermittent way, can they be used as computenodes in some fashion? Wil gridengine/PBS /??? take care of thissort of process migration?

Grid Engine will not transparently preserve and migrate running jobsoff of machines that get bounced suddenly. This sort of transparentand automatic checkpointing and migration is actually pretty hard todo in practice. If you know in advance which machines are going tobe shut down and rebooted into windows then there are tools in allthe common scheduling packages for "draining" a particular machine orqueue. You can also "kill and reschedule" jobs that are running onany one queue instance or cluster queue. One can even do this on acalendar basis when the "need windows" schedule is predictable (doesnot seem possible in your case). If the running cluster jobs areshort lived so that you don't have a big runtime investment then youcan bounce machines whenever you want - Grid Engine can be told toreschedule failed jobs automatically to a different available host --the hard case to deal with is the very long running jobs that (a)can't be reliably checkpoint or (b) are difficult to suspend/resume/migrate due to the parallel application itself.


The answer may be application specific in your case.

Regards,
Chris

best regards,

Nathan



- - - - - - - - - - - - - - - - - - - - - - -
Nathan Moore
Physics
Winona State University
[EMAIL PROTECTED]
AIM:nmoorewsu


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] picking out a job scheduler

Reply via email to