Hi all,

fyi this is the ticket I opened up:
https://issues.apache.org/jira/browse/MAPREDUCE-6923
Thanks in advance!

Robert

On Mon, Jul 31, 2017 at 10:21 PM, Ravi Prakash <[email protected]> wrote:

> Hi Robert!
>
> I'm sorry I do not have a Windows box and probably don't understand the
> shuffle process well enough. Could you please create a JIRA in the
> mapreduce proect if you would like this fixed upstream?
> https://issues.apache.org/jira/secure/RapidBoard.jspa?
> rapidView=116&projectKey=MAPREDUCE
>
> Thanks
> Ravi
>
> On Mon, Jul 31, 2017 at 6:36 AM, Robert Schmidtke <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I just ran into an issue, which likely resulted from my not very
>> intelligent configuration, but nonetheless I'd like to share this with the
>> community. This is all on Hadoop 2.7.3.
>>
>> In my setup, each reducer roughly fetched 65K from each mapper's spill
>> file. I disabled transferTo during shuffle, because I wanted to have a look
>> at the file system statistics, which miss mmap calls, which is what
>> transferTo sometimes defaults to. I left the shuffle buffer size at 128K
>> (not knowing about the parameter at the time). This had the effect that I
>> observed roughly 100% more data being read during shuffle, since 128K were
>> read for each 65K needed.
>>
>> I added a quick fix to Hadoop which chooses the minimum of the partition
>> size and the shuffle buffer size: https://github.com/apach
>> e/hadoop/compare/branch-2.7.3...robert-schmidtke:adaptive-shuffle-buffer
>> Benchmarking this version against transferTo.allowed=true yields the same
>> runtime and roughly 10% more reads in YARN during the shuffle phase
>> (compared to previous 100%).
>> Maybe this is something that should be added to Hadoop? Or do users have
>> to be more clever about their job configurations? I'd be happy to open a PR
>> if this is deemed useful.
>>
>> Anyway, thanks for the attention!
>>
>> Cheers
>> Robert
>>
>> --
>> My GPG Key ID: 336E2680
>>
>
>


-- 
My GPG Key ID: 336E2680

Reply via email to