On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
NFSroot worked fine, and when I installed CentOS 6 on a local disk,
the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea why
this is only occuring with RHEL 6 w/ NFS root OS?
Sounds suspiciously like a network or other driver running hard in a
tight polling mode causing a growing number of CSW/Ints over time. Since
these are opteron (really? still in use?) chances are you might have a
firmware issue on the set of slower nodes, that had been corrected on
the other nodes. With NFS root, if you have a node locking a
particular file that the other nodes want to write to, the node can
appear slow while it waits on the IO.
You might try running dstat and saving output into a file from boot
onwards. Then run the tests, and see if the int or CSW are being driven
very high. Pay attention to the usr/idl and other percentages.
You can also grab temperature stats. Helps if you have ipmi.
ipmitool sdr
ipmitool sdr | grep Temp
CPU1 Temp | 35 degrees C | ok
CPU2 Temp | 35 degrees C | ok
System Temp | 35 degrees C | ok
Peripheral Temp | 38 degrees C | ok
PCH Temp | 43 degrees C | ok
If not, sensors
sensors
Package id 1: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 0: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 1: +35.0°C (high = +82.0°C, crit = +92.0°C)
Core 2: +33.0°C (high = +82.0°C, crit = +92.0°C)
Core 3: +34.0°C (high = +82.0°C, crit = +92.0°C)
...
--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf