On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal <[email protected]> wrote: > Hi, > I had some questions specifically on the Map-Reduce phase: > > [1] For the reduce phase, the TaskTrackers corresponding to the reduce node, > poll the Job Tracker to know about maps that have completed and if the > Jobtracker informs it about maps that are complete, it then pulls the data > from the map node where the map is complete. This is a "pull" model as > opposed to "push" model where the map directly sends a region of the map > output to the appropriate reduce node. Is the pull model the default in > 0.20, 0.23 etc ?
Yes, it is. > In the pull model, how does the Reduce node know it is responsible for a > particular region of map output? (Is this determined up front? From where it > gets this information?) The reducer ID is == the partition ID from the map side. Thereby, reducer 0 will pull 0th partitions, and so on. > [2]There can be multiple reduce tasks per reduce node. The number of reduce > tasks is configurable, How about the number of reduce nodes? How is this > determined? Reducers are assigned by the scheduler in use. They are assigned once per TT heartbeat, to have somewhat of an equal distribution today - but there is no way to induce locality of reducers from the user-program itself. So far I've not seen a case where this may be absolutely necessary to have, as the schedulers today are pretty capable of doing things right. > [3]Pre 0.23, The map/reduce tasks slots for a node are allocated statically > . Is this based on just configuration ? What do you mean by 'allocated statically' here? Are you talking about slot limit configurations? -- Harsh J
