Re: Performance of direct vs indirect shuffling

Arun C Murthy Tue, 20 Dec 2011 17:00:31 -0800

On Dec 20, 2011, at 3:55 PM, Kevin Burton wrote:

> The current hadoop implementation shuffles directly to disk and then those 
> disk files are eventually requested by the target nodes which are responsible 
> for doing the reduce() on the intermediate data.
> 
> However, this requires more 2x IO than strictly necessary.
> 
> If the data were instead shuffled DIRECTLY to the target host, this IO 
> overhead would be removed.
>


We've discussed 'push' v/s 'pull' shuffle multiple times and each time turned 
away due to complexities in MR1. With MRv2 (YARN) this would be much more 
doable.

IAC...

A single reducer, in typical (well-designed?) applications, process multiple 
gigabytes of data across thousands of maps.

So, to really not do any disk i/o during the shuffle you'd need very large 
amounts of RAM...

Also, currently, the shuffle is effected by the reduce task. This has two major 
benefits :
# The 'pull' can only be initiated after the reduce is scheduled. The 'push' 
model would be hampered if the reduce hasn't been started.
# The 'pull' is more resilient to failure of a single reduce. In the push 
model, it's harder to deal with a reduce failing after a push from the map.

Again, with MR2 we could experiment with push v/s pull where it makes sense 
(small jobs etc.). I'd love to mentor/help someone interested in putting cycles 
into it.

Arun

Re: Performance of direct vs indirect shuffling

Reply via email to