[jira] [Updated] (MAPREDUCE-7349) An unexpected node crash and delayed messages would fail the job

gaoyu (Jira) Thu, 03 Jun 2021 01:29:23 -0700


     [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


gaoyu updated MAPREDUCE-7349:
-----------------------------
    Description: 
Related cluster configuration:
 * MAX_FETCH_FAILURES_NOTIFICATIONS is 3
 * NodeManager recovery is disabled

Bug scenario:
 # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and 
{{map_1}}) and 1 simple reduce task ({{reduce_0}});
 # all map tasks were finished successfully and the AppMaster was notified;
 # the NodeManager which runs the map task {{map_1}} crashes;
 # the AppMaster schedules a reduce attempt;
 # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a 
fetch failure;
 # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused by 
{{java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
 # the reduce attempt send message {{fatalError}} to AppMaster
 # the AppMaster successively reschedules another three reduce attempts, but 
all of them were failed due to {{Shuffle$ShuffleError}};
 # AppMaster fails the wordcount job due to the failed reduce task;
 # AppMaster receives three {{statusUpdate}} messages that state a fetch 
failure like the message in step 5, but it has already failed the job and would 
not rerun the task {{map_1}}.
  
  

> An unexpected node crash and delayed messages would fail the job
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-7349
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7349
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster
>    Affects Versions: 3.2.2
>            Reporter: gaoyu
>            Priority: Major
>
> Related cluster configuration:
>  * MAX_FETCH_FAILURES_NOTIFICATIONS is 3
>  * NodeManager recovery is disabled
> Bug scenario:
>  # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and 
> {{map_1}}) and 1 simple reduce task ({{reduce_0}});
>  # all map tasks were finished successfully and the AppMaster was notified;
>  # the NodeManager which runs the map task {{map_1}} crashes;
>  # the AppMaster schedules a reduce attempt;
>  # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a 
> fetch failure;
>  # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused 
> by {{java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}};
>  # the reduce attempt send message {{fatalError}} to AppMaster
>  # the AppMaster successively reschedules another three reduce attempts, but 
> all of them were failed due to {{Shuffle$ShuffleError}};
>  # AppMaster fails the wordcount job due to the failed reduce task;
>  # AppMaster receives three {{statusUpdate}} messages that state a fetch 
> failure like the message in step 5, but it has already failed the job and 
> would not rerun the task {{map_1}}.
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (MAPREDUCE-7349) An unexpected node crash and delayed messages would fail the job

Reply via email to