On Dec 20, 2011, at 3:55 PM, Kevin Burton wrote: > The current hadoop implementation shuffles directly to disk and then those > disk files are eventually requested by the target nodes which are responsible > for doing the reduce() on the intermediate data. > > However, this requires more 2x IO than strictly necessary. > > If the data were instead shuffled DIRECTLY to the target host, this IO > overhead would be removed. >
We've discussed 'push' v/s 'pull' shuffle multiple times and each time turned away due to complexities in MR1. With MRv2 (YARN) this would be much more doable. IAC... A single reducer, in typical (well-designed?) applications, process multiple gigabytes of data across thousands of maps. So, to really not do any disk i/o during the shuffle you'd need very large amounts of RAM... Also, currently, the shuffle is effected by the reduce task. This has two major benefits : # The 'pull' can only be initiated after the reduce is scheduled. The 'push' model would be hampered if the reduce hasn't been started. # The 'pull' is more resilient to failure of a single reduce. In the push model, it's harder to deal with a reduce failing after a push from the map. Again, with MR2 we could experiment with push v/s pull where it makes sense (small jobs etc.). I'd love to mentor/help someone interested in putting cycles into it. Arun
