I would also suspect a thermal issue, though it could also be firmware. To verify a temperature problem, you might try setting up lm_sensors or scraping "ipmitool sdr" output (whichever is easier) regularly and try to make a performance-vs-temperature plot for each node. As Andrew mentioned, it could also be firmware/CPU microcode. We recently tracked down a problem with some of our nodes that ended up being microcode-related; the CPUs would start in a high-power state, but end up getting stuck in a low-power state, regardless of what power management settings we had set in the BIOS.
Skylar On Fri, Sep 8, 2017 at 7:41 PM, Prentice Bisbal <pbis...@pppl.gov> wrote: > Beowulfers, > > I need your assistance debugging a problem: > > I have a dozen servers that are all identical hardware: SuperMicro servers > with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the > users have been complaining of wildly inconsistent performance across these > 12 nodes. I ran LINPACK on these nodes, and was able to duplicate the > problem, with performance varying from ~14 GFLOPS to 64 GFLOPS. > > I've identified that performance on the slower nodes starts off fine, and > then slowly degrades throughout the LINPACK run. For example, on a node > with this problem, during first LINPACK test, I can see the performance > drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend > continues throughout the remaining tests. At the start of subsequent tests, > performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS > at the end of the test. > > Because of the nature of this problem, I suspect this might be a thermal > issue. My guess is that the processor speed is being throttled to prevent > overheating on the "bad" nodes. > > But here's the thing: this wasn't a problem until we upgraded to CentOS 6. > Where I work, we use a read-only NFSroot filesystem for our cluster nodes, > so all nodes are mounting and using the same exact read-only image of the > operating system. This only happens with these SuperMicro nodes, and only > with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I > installed CentOS 6 on a local disk, the nodes worked fine. > > Any ideas where to look or what to tweak to fix this? Any idea why this is > only occuring with RHEL 6 w/ NFS root OS? > > -- > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf