On 09/08/2017 02:41 PM, Prentice Bisbal wrote:

But here's the thing: this wasn't a problem until we upgraded to CentOS 6. Where I work, we use a read-only NFSroot filesystem for our cluster nodes, so all nodes are mounting and using the same exact read-only image of the operating system. This only happens with these SuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I installed CentOS 6 on a local disk, the nodes worked fine.

Any ideas where to look or what to tweak to fix this? Any idea why this is only occuring with RHEL 6 w/ NFS root OS?


Sounds suspiciously like a network or other driver running hard in a tight polling mode causing a growing number of CSW/Ints over time. Since these are opteron (really? still in use?) chances are you might have a firmware issue on the set of slower nodes, that had been corrected on the other nodes. With NFS root, if you have a node locking a particular file that the other nodes want to write to, the node can appear slow while it waits on the IO.

You might try running dstat and saving output into a file from boot onwards. Then run the tests, and see if the int or CSW are being driven very high. Pay attention to the usr/idl and other percentages.

You can also grab temperature stats.  Helps if you have ipmi.

    ipmitool sdr

 ipmitool sdr | grep Temp
CPU1 Temp        | 35 degrees C      | ok
CPU2 Temp        | 35 degrees C      | ok
System Temp      | 35 degrees C      | ok
Peripheral Temp  | 38 degrees C      | ok
PCH Temp         | 43 degrees C      | ok

If not, sensors

sensors
Package id 1:  +35.0°C  (high = +82.0°C, crit = +92.0°C)
Core 0:        +35.0°C  (high = +82.0°C, crit = +92.0°C)
Core 1:        +35.0°C  (high = +82.0°C, crit = +92.0°C)
Core 2:        +33.0°C  (high = +82.0°C, crit = +92.0°C)
Core 3:        +34.0°C  (high = +82.0°C, crit = +92.0°C)
...



--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to