Thanks, again, Liyin. On Sat, Aug 4, 2012 at 6:59 AM, 梁李印 <[email protected]> wrote:
> The optimization you mentioned is reduce-task locality-aware. > Unfortunately, > the current scheduler doesn't consider the reduce task's data locality. So > a > reduce task can be scheduled to any node with free slots. > The following jira is discussing this problem: > https://issues.apache.org/jira/browse/MAPREDUCE-2038 > > Liyin Liang > -----邮件原件----- > 发件人: Satheesh Kumar [mailto:[email protected]] > 发送时间: 2012年8月4日 1:47 > 收件人: [email protected] > 主题: Re: 答复: MapReduce shuffle question > > Thank you. One more follow up question: > > Are there any optimizations to run map and reduces on the same nodes so > that data is not transported across the network? Generally, how often and > what % of map output is actually transferred over the network to reduce > nodes? > > Thanks, > Satheesh > > On Fri, Aug 3, 2012 at 7:33 AM, 梁李印 <[email protected]> wrote: > > > When a map task is done, its output is always flushed to the disk and > > merged > > to one file. > > The benefit is that if the reducer is failed, the map need not to re-run. > > > > Liyin Liang > > > > -----邮件原件----- > > 发件人: Satheesh Kumar [mailto:[email protected]] > > 发送时间: 2012年8月3日 21:23 > > 收件人: [email protected] > > 主题: MapReduce shuffle question > > > > Team, can someone please clarify the following question? > > > > In the map phase, the map output is written to the local disk. And in the > > shuffle phase, the map output partitions are transferred to reduce nodes > > using http. So, my question is assuming there are no spills (data set is > > small enough to accommodate this), will the map output be transferred > > directly from memory to the reduce nodes using http without a disk access > > to write the map output? Or, is the map output always flushed to the disk > > before transferred to reduce nodes? > > > > Appreciate the help. > > > > Thanks, > > Satheesh > > > > > >
