[Beowulf] Newbie questions on cluster technology to use

Kirill Lapshin Wed, 01 Aug 2007 13:02:32 -0700

Hi all,

This is my first post, please accept my apologies if questions are toosimple. I would appreciate pointers to documentation, writeups, howtosetc. I did some research before posting here, but got lost in sheeramount of information and competing technologies available.

We are planning to setup a cluster at my work place to handle somecomputation heavy jobs we have and the main task at the moment is tochoose the right technology.


First of all, let me try to describe the task we have at hand.

1. There are a lot of relatively short jobs submitted by users. Thereare also much longer jobs submitted automatically at a known schedule.

2. Even though jobs are short (take minutes to complete on singlemachine) it is still important to parallelize each job to run them evenfaster (order of tens of seconds). That's financial industry we aretalking about and time is money.

3. Jobs are quite easily parallelizable, probably embarrassingly so.Simple master/slave pattern naturally applies here. We already haveparallel implementation running on a single host utilizing multipleprocessors via threads. It would be nice to be able to do it over manymachines as well.

4. Jobs have to be scheduled properly, meaning that some users shouldhave higher priority than others and especially than automated longrunning jobs, if user submits too many jobs his priority decreases, etc.

5. Implementation have to be fault tolerant, transparently survivingindividual machine failures. Transparently for users that is, it is Okto program tasks in a special way to get fault tolerance.

6. It would be nice to be able to submit "backup" tasks once job nearscompletion just in case some nodes in cluster are running slow. E.g. ifjob is split in 1000 tasks, runs on 16 node cluster and it is almostdone, there are just 4 tasks to finish the job and there are a lot ofidling nodes on cluster, scheduler could submit each of outstanding taskto two machines and pick up results from whichever one completes first.If cluster is heterogeneous, or one node just runs slower it couldspeedup job completion considerably. At least that's what I've read inGoogle's mapreduce paper.

7. Some cluster health monitoring is needed. Does not have to besophisticated, but at least we should be able to learn easily that somehost has died and needs repairment. Statistics are nice to have as wellto be able to adjust user priorities, make decisions on buying newhardware etc.

8. The business is somewhat Windows centric, though I would try to pushLinux as a platform. It is doable, provided benefits are good. Linuxport of the program is not a problem.


Potential solutions I see:

1. TIBCO distributed queue. In short it is a proprietary solution thatmore or less is a fault tolerant load balancing. The downside is absenceof any scheduler (works as FIFO) and the fact that it is proprietary. Wewould much rather use open source technologies. See below for a bit ofinfo on TIBCO.

2. MPI with some scheduler (Condor?). From what I read looks like faulttolerance is not easy to achieve in MPI world, and even if it ispossible, then failure on a master node will render whole clusterunusable. I could be wrong on this, and I hope I am.


3. Torque? Grid Engine? Globus? Something else?

What are your suggestions? We need to decide on technology and try toimplement it, gaining more knowledge in the process and hopefully makingmore informed decision in version 2 of our cluster. Any input would begreatly appreciated.

Some details on TIBCO. Tibco at heart is an enterprise messaging system,which propagates information via broadcasts on the same subnet and canroute it from subnet to subnet via special daemon. It was mainlydesigned to integrate various systems in enterprise via common pipewhere each system connects for data exchange, instead of building manypoint-to-point connections between individual systems. On top of thismessaging technology they developed distributed queue, which works likethis: you start many copies of app on many machines, they all find eachother via broadcasts, elect a master among themselves, send heartbitmessages every now and then to monitor healths of nodes, once messagearrives, master chooses which worker should process it. If one of thenodes dies, master resubmits his task to other node. If master dies,remaining nodes elect new master and keep going from there. It ispossible since all communication is done via broadcasts, and every nodecan maintain master's state.



Regards,

Kirill Lapshin

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Newbie questions on cluster technology to use

Reply via email to