Hi Wu. If yarn.nodemanager.resource.memory-mb is greater than the amount of
memory on a specific node, the scheduler will assign more containers to that
node than probably should be running there. They will still run, but it will
cause a lot of disk swapping, which will slow down each task running on that
node.
I don't know much about the FairScheduler's preemption, but if preemption is
aggressive, it could potentially kill more containers than are necessary, which
causes the app to lose work that has to be redone.
From: wuchang <[email protected]>
To: [email protected]
Cc: Chang. Wu <[email protected]>
Sent: Monday, May 22, 2017 11:32 PM
Subject: What if the configured node memory in yarn-site.xml is more than
node's physical memory?
My yarn queue is using FairScheduler as my scheduler for my 4 queues, below is
my queue configuration:<allocations> <queue name="highPriority">
<minResources>100000 mb, 30 vcores</minResources> <maxResources>250000 mb, 100
vcores</maxResources> </queue> <queue name="default"> <minResources>50000 mb,
20 vcores</minResources> <maxResources>100000 mb, 50 vcores</maxResources>
<maxAMShare>-1.0f</maxAMShare> </queue> <queue name="ep"> <minResources>100000
mb, 30 vcores</minResources> <maxResources>300000 mb, 100 vcores</maxResources>
<maxAMShare>-1.0f</maxAMShare> </queue> <queue name="vip"> <minResources>30000
mb, 20 vcores</minResources> <maxResources>60000 mb, 50 vcores</maxResources>
<maxAMShare>-1.0f</maxAMShare> </queue>
<fairSharePreemptionTimeout>300</fairSharePreemptionTimeout></allocations>
Obviously , I didn’t configure any preemption , so , the total cluster resource
usage is very low , but , everything is at least running OK except that the
total resource usage rate of my cluster is not very high.
So , I decide to turn on preemption and modify the fair-scheduler.xml like
below:
<allocations> <queue name="highPriority"> <minResources>100000 mb, 30
vcores</minResources> <maxResources>300000 mb, 100 vcores</maxResources>
<weight>0.35</weight> <minSharePreemptionTimeout>20</minSharePreemptionTimeout>
<fairSharePreemptionTimeout>25</fairSharePreemptionTimeout>
<fairSharePreemptionThreshold>0.8</fairSharePreemptionThreshold>
<maxAMShare>0.3f</maxAMShare> <maxRunningApps>18</maxRunningApps> </queue>
<queue name="default"> <minResources>50000 mb, 20 vcores</minResources>
<maxResources>140000 mb, 70 vcores</maxResources> <weight>0.14</weight>
<minSharePreemptionTimeout>20</minSharePreemptionTimeout>
<fairSharePreemptionTimeout>25</fairSharePreemptionTimeout>
<fairSharePreemptionThreshold>0.5</fairSharePreemptionThreshold>
<maxAMShare>0.3f</maxAMShare> <maxRunningApps>20</maxRunningApps> </queue>
<queue name="ep"> <minResources>100000 mb, 30 vcores</minResources>
<maxResources>600000 mb, 100 vcores</maxResources> <weight>0.42</weight>
<minSharePreemptionTimeout>20</minSharePreemptionTimeout>
<fairSharePreemptionTimeout>25</fairSharePreemptionTimeout>
<fairSharePreemptionThreshold>0.8</fairSharePreemptionThreshold>
<maxAMShare>0.3f</maxAMShare> <maxRunningApps>20</maxRunningApps> </queue>
<queue name="vip"> <minResources>6000 mb, 20 vcores</minResources>
<maxResources>120000 mb, 30 vcores</maxResources> <weight>0.09</weight>
<minSharePreemptionTimeout>20</minSharePreemptionTimeout>
<fairSharePreemptionTimeout>25</fairSharePreemptionTimeout>
<fairSharePreemptionThreshold>0.8</fairSharePreemptionThreshold>
<maxAMShare>0.3f</maxAMShare> <maxRunningApps>10</maxRunningApps>
</queue></allocations>
Yes , after preemption is turned on , the total resource usage rate of my
cluster is up to 90%+ , but after one night(midnight is the busiest time for
my yarn cluster) , I find that many applications delays.
After a long time of trouble-shooting, I find that in my 9 machine cluster, 5
has physical memory of 128G, and the left 4 machine has pythical memory 64G,
but all their yarn-site.xml , the yarn.nodemanager.resource.memory-mb is
configured as 97280 ,that is to say , the yarn.nodemanager.resource.memory-mb
configuration in 4 machines is actually more that the actual pythical memory .
So ,I doubt if this is what result in the phenomenon that even though the total
cluster resource usage is improves, but each application takes more time to
execute and delayed seriously.
Any suggestions?