Re: Map Reduce Phase questions:

Harsh J Fri, 16 Dec 2011 06:23:42 -0800

On Fri, Dec 16, 2011 at 7:03 PM, Ann Pal <[email protected]> wrote:
> Hi,
> I had some questions specifically on the Map-Reduce phase:
>
> [1] For the reduce phase, the TaskTrackers corresponding to the reduce node,
> poll the Job Tracker to know about maps that have completed and if the
> Jobtracker informs it about maps that are complete, it then pulls the data
> from the map node where the map is complete. This is a "pull" model as
> opposed to "push" model where the map directly sends a region of the map
> output to the appropriate reduce node. Is the pull model the default  in
> 0.20, 0.23 etc ?


Yes, it is.

> In the pull model, how does the Reduce node know it is responsible for a
> particular region of map output? (Is this determined up front? From where it
> gets this information?)

The reducer ID is == the partition ID from the map side. Thereby,
reducer 0 will pull 0th partitions, and so on.

> [2]There can be multiple reduce tasks per reduce node. The number of reduce
> tasks is configurable, How about the number of reduce nodes? How is this
> determined?

Reducers are assigned by the scheduler in use. They are assigned once
per TT heartbeat, to have somewhat of an equal distribution today -
but there is no way to induce locality of reducers from the
user-program itself. So far I've not seen a case where this may be
absolutely necessary to have, as the schedulers today are pretty
capable of doing things right.

> [3]Pre 0.23, The map/reduce tasks slots for a node are allocated statically
> . Is this based on just configuration ?

What do you mean by 'allocated statically' here? Are you talking about
slot limit configurations?

-- 
Harsh J

Re: Map Reduce Phase questions:

Reply via email to