Hi all, fyi this is the ticket I opened up: https://issues.apache.org/jira/browse/MAPREDUCE-6923 Thanks in advance!
Robert On Mon, Jul 31, 2017 at 10:21 PM, Ravi Prakash <[email protected]> wrote: > Hi Robert! > > I'm sorry I do not have a Windows box and probably don't understand the > shuffle process well enough. Could you please create a JIRA in the > mapreduce proect if you would like this fixed upstream? > https://issues.apache.org/jira/secure/RapidBoard.jspa? > rapidView=116&projectKey=MAPREDUCE > > Thanks > Ravi > > On Mon, Jul 31, 2017 at 6:36 AM, Robert Schmidtke <[email protected]> > wrote: > >> Hi all, >> >> I just ran into an issue, which likely resulted from my not very >> intelligent configuration, but nonetheless I'd like to share this with the >> community. This is all on Hadoop 2.7.3. >> >> In my setup, each reducer roughly fetched 65K from each mapper's spill >> file. I disabled transferTo during shuffle, because I wanted to have a look >> at the file system statistics, which miss mmap calls, which is what >> transferTo sometimes defaults to. I left the shuffle buffer size at 128K >> (not knowing about the parameter at the time). This had the effect that I >> observed roughly 100% more data being read during shuffle, since 128K were >> read for each 65K needed. >> >> I added a quick fix to Hadoop which chooses the minimum of the partition >> size and the shuffle buffer size: https://github.com/apach >> e/hadoop/compare/branch-2.7.3...robert-schmidtke:adaptive-shuffle-buffer >> Benchmarking this version against transferTo.allowed=true yields the same >> runtime and roughly 10% more reads in YARN during the shuffle phase >> (compared to previous 100%). >> Maybe this is something that should be added to Hadoop? Or do users have >> to be more clever about their job configurations? I'd be happy to open a PR >> if this is deemed useful. >> >> Anyway, thanks for the attention! >> >> Cheers >> Robert >> >> -- >> My GPG Key ID: 336E2680 >> > > -- My GPG Key ID: 336E2680
