[
https://issues.apache.org/jira/browse/MAPREDUCE-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
gaoyu updated MAPREDUCE-7349:
-----------------------------
Description:
Related cluster configuration:
* MAX_FETCH_FAILURES_NOTIFICATIONS is 3
* NodeManager recovery is disabled
Bug scenario:
# submit a wordcount job which contains 2 simple map tasks ({{map_0}} and
{{map_1}}) and 1 simple reduce task ({{reduce_0}});
# all map tasks were finished successfully and the AppMaster was notified;
# the NodeManager which runs the map task {{map_1}} crashes;
# the AppMaster schedules a reduce attempt;
# the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a
fetch failure;
# the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused by
{{java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
# the reduce attempt send message {{fatalError}} to AppMaster
# the AppMaster successively reschedules another three reduce attempts, but
all of them were failed due to {{Shuffle$ShuffleError}};
# AppMaster fails the wordcount job due to the failed reduce task;
# AppMaster receives three {{statusUpdate}} messages that state a fetch
failure like the message in step 5, but it has already failed the job and would
not rerun the task {{map_1}}.
> An unexpected node crash and delayed messages would fail the job
> ----------------------------------------------------------------
>
> Key: MAPREDUCE-7349
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-7349
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: applicationmaster
> Affects Versions: 3.2.2
> Reporter: gaoyu
> Priority: Major
>
> Related cluster configuration:
> * MAX_FETCH_FAILURES_NOTIFICATIONS is 3
> * NodeManager recovery is disabled
> Bug scenario:
> # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and
> {{map_1}}) and 1 simple reduce task ({{reduce_0}});
> # all map tasks were finished successfully and the AppMaster was notified;
> # the NodeManager which runs the map task {{map_1}} crashes;
> # the AppMaster schedules a reduce attempt;
> # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a
> fetch failure;
> # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused
> by {{java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
> # the reduce attempt send message {{fatalError}} to AppMaster
> # the AppMaster successively reschedules another three reduce attempts, but
> all of them were failed due to {{Shuffle$ShuffleError}};
> # AppMaster fails the wordcount job due to the failed reduce task;
> # AppMaster receives three {{statusUpdate}} messages that state a fetch
> failure like the message in step 5, but it has already failed the job and
> would not rerun the task {{map_1}}.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]