Hi all,

This is my first post, please accept my apologies if questions are too simple. I would appreciate pointers to documentation, writeups, howtos etc. I did some research before posting here, but got lost in sheer amount of information and competing technologies available.

We are planning to setup a cluster at my work place to handle some computation heavy jobs we have and the main task at the moment is to choose the right technology.

First of all, let me try to describe the task we have at hand.

1. There are a lot of relatively short jobs submitted by users. There are also much longer jobs submitted automatically at a known schedule.

2. Even though jobs are short (take minutes to complete on single machine) it is still important to parallelize each job to run them even faster (order of tens of seconds). That's financial industry we are talking about and time is money.

3. Jobs are quite easily parallelizable, probably embarrassingly so. Simple master/slave pattern naturally applies here. We already have parallel implementation running on a single host utilizing multiple processors via threads. It would be nice to be able to do it over many machines as well.

4. Jobs have to be scheduled properly, meaning that some users should have higher priority than others and especially than automated long running jobs, if user submits too many jobs his priority decreases, etc.

5. Implementation have to be fault tolerant, transparently surviving individual machine failures. Transparently for users that is, it is Ok to program tasks in a special way to get fault tolerance.

6. It would be nice to be able to submit "backup" tasks once job nears completion just in case some nodes in cluster are running slow. E.g. if job is split in 1000 tasks, runs on 16 node cluster and it is almost done, there are just 4 tasks to finish the job and there are a lot of idling nodes on cluster, scheduler could submit each of outstanding task to two machines and pick up results from whichever one completes first. If cluster is heterogeneous, or one node just runs slower it could speedup job completion considerably. At least that's what I've read in Google's mapreduce paper.

7. Some cluster health monitoring is needed. Does not have to be sophisticated, but at least we should be able to learn easily that some host has died and needs repairment. Statistics are nice to have as well to be able to adjust user priorities, make decisions on buying new hardware etc.

8. The business is somewhat Windows centric, though I would try to push Linux as a platform. It is doable, provided benefits are good. Linux port of the program is not a problem.

Potential solutions I see:

1. TIBCO distributed queue. In short it is a proprietary solution that more or less is a fault tolerant load balancing. The downside is absence of any scheduler (works as FIFO) and the fact that it is proprietary. We would much rather use open source technologies. See below for a bit of info on TIBCO.

2. MPI with some scheduler (Condor?). From what I read looks like fault tolerance is not easy to achieve in MPI world, and even if it is possible, then failure on a master node will render whole cluster unusable. I could be wrong on this, and I hope I am.

3. Torque? Grid Engine? Globus? Something else?

What are your suggestions? We need to decide on technology and try to implement it, gaining more knowledge in the process and hopefully making more informed decision in version 2 of our cluster. Any input would be greatly appreciated.


Some details on TIBCO. Tibco at heart is an enterprise messaging system, which propagates information via broadcasts on the same subnet and can route it from subnet to subnet via special daemon. It was mainly designed to integrate various systems in enterprise via common pipe where each system connects for data exchange, instead of building many point-to-point connections between individual systems. On top of this messaging technology they developed distributed queue, which works like this: you start many copies of app on many machines, they all find each other via broadcasts, elect a master among themselves, send heartbit messages every now and then to monitor healths of nodes, once message arrives, master chooses which worker should process it. If one of the nodes dies, master resubmits his task to other node. If master dies, remaining nodes elect new master and keep going from there. It is possible since all communication is done via broadcasts, and every node can maintain master's state.


Regards,

Kirill Lapshin

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to