For distribution of load you can start reading some chapters from different types of hadoop scheduler. I have not yet studied other implementation like hadoop, however a very simplified version of distribution concept is the following:
a) Tasktracker ask for work (heartbeat consist of a status of the worker node - # free slots) b) Jobtracker pick a job from a list which is sorted based on the specified policy (fairscheduling, fifo, lifo, other sla) c) Tasktracker executes the map/reduce job Like mentioned before there are a lot more details.. In b) there exists an implementation of delay scheduling which is there to improve throughput by taking account of input data location for a picked job. There you have a preemption mechanism that regulate the fairness between pools,etc.. A good start is book that Preshant mentioned... On 23 April 2012 23:49, Prashant Kommireddi <[email protected]> wrote: > Shailesh, there's a lot that goes into distributing work across > tasks/nodes. It's not just distributing work but also fault-tolerance, > data locality etc that come into play. It might be good to refer > Hadoop apache docs or Tom White's definitive guide. > > Sent from my iPhone > > On Apr 23, 2012, at 11:03 AM, Shailesh Samudrala <[email protected]> > wrote: > > > Hello, > > > > I am trying to design my own MapReduce Implementation and I want to know > > how hadoop is able to distribute its workload across multiple computers. > > Can anyone shed more light on this? thanks! >
