acedia28 opened a new pull request #3324:
URL: https://github.com/apache/hadoop/pull/3324
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
On our cluster with a large number of NMs, preemption monitor thread
consistently got java.util.ConcurrentModificationException when specific
conditions met. (And preemption doesn't work, of course)
What We found as conditions are as follow. (All 4 conditions should be met)
1. There are at least two non-exclusive partitions except default partition
(let me call the partitions as X and Y partition)
2. app1 in the queue belonging to default partition (let me call the queue
as 'dev' queue) borrowed resources from both X, Y partitions
3. app2, app3 submitted to queues belonging to each X, Y partition is
'PENDING' because resources are consumed by app1
4. Preemption monitor can clear borrowed resources from X or Y when the
container of app1 is preempted.
Main problem is that FifoCandiatesSelector.selectCandidates tried to remove
HashMap key(partition name) while iterating HashMap.
Logically, it is correct because we didn't traverse the same partition again
on this 'selectCandidates'. However HashMap structure does not allow
modification while iterating.
I made test case to reproduce the error
case(testResourceTypesInterQueuePreemptionWithThreePartitions).
We found and patched our cluster on 3.1.2 but it seems trunk still has the
same problem.
I attached patch based on the trunk.
Apache jira URL: https://issues.apache.org/jira/browse/YARN-10892
Thanks!
>> {{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor
(SchedulingMonitor.java:run(116)) - Exception raised while executing preemption
checker, skip this run..., exception=
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
at
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)}}
### How was this patch tested?
I added new testcase to reproduce the problem.
new testcase will be failed without this patch.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]