Re: [Beowulf] Varying performance across identical cluster nodes.

Prentice Bisbal Thu, 14 Sep 2017 11:46:27 -0700

Beowulfers,

I'm happy to announce that I finally found the cause this problem:numad. On these particular systems, numad was having a catastrophiceffect on the performance. As the jobs ran GFLOPS would steadilydecrease in a monotonic fashion, watching the output of turbostat and'cpupower monitor' I could see more and more cores becoming idle as thejob ran. As soon as I turned off numad and restarted my LINPACK jobs,the performance went back up, and now it stayed there for the durationof the job.

To make sure I wasn't completely crazy for having numad enabled on thesesystems, I did a google search and came across the paper below, whichindicates that in some cases having numad is helpful, and in othercases, it isn't:


http://iopscience.iop.org/article/10.1088/1742-6596/664/9/092010/pdf

To verify this fix, I ran LINPACK again across all the nodes in thiscluster (well, all the nodes that weren't running user jobs at thetime), in addition to the Supermicro nodes. I found that on thenon-Supermicro nodes, which are Proliant servers with different Opteronprocessors, turning numad off actually decreased performance by about 5% .

Have any of you had similar problems with numad? Do you leave it on oroff on your cluster nodes? Feedback is greatly appreciated. I did agoogle search of 'Linux numad HPC performance' (or something like that),and the link above was I could find on this topic.

For now, I think I'm going to leave numad enabled on the non-Supermicronodes until I can do more research/testing.


Prentice

On 09/13/2017 01:48 PM, Prentice Bisbal wrote:

Okay, based on the various responses I've gotten here and on otherlists, I feel I need to clarify things:
This problem only occurs when I'm running our NFSroot based version ofthe OS (CentOS 6). When I run the same OS installed on a local disk, Ido not have this problem, using the same exact server(s). For testingpurposes, I'm using LINPACK, and running the same executable with thesame HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this(should) eliminate the problem being in the BIOS, and faulty hardware.This leads me to believe it's most likely a software configurationissue, like a kernel tuning parameter, or some other softwareconfiguration issue.
These are Supermicro servers, and it seems they do not provide CPUtemps. I do see a chassis temp, but not the temps of the individualCPUs. While I agree that should be the first thing I look at, it's notan option for me. Other tools like FLIR and Infrared thermometersaren't really an option for me, either.
What software configuration, either a kernel a parameter,configuration of numad or cpuspeed, or some other setting, couldaffect this?
Prentice

On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
Beowulfers,

I need your assistance debugging a problem:
I have a dozen servers that are all identical hardware: SuperMicroservers with AMD Opteron 6320 processors. Every since we upgraded toCentOS 6, the users have been complaining of wildly inconsistentperformance across these 12 nodes. I ran LINPACK on these nodes, andwas able to duplicate the problem, with performance varying from ~14GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts off fine,and then slowly degrades throughout the LINPACK run. For example, ona node with this problem, during first LINPACK test, I can see theperformance drop from 115 GFLOPS down to 11.3 GFLOPS. That constant,downward trend continues throughout the remaining tests. At the startof subsequent tests, performance will jump up to about 9-10 GFLOPS,but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this might be athermal issue. My guess is that the processor speed is beingthrottled to prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded toCentOS 6. Where I work, we use a read-only NFSroot filesystem for ourcluster nodes, so all nodes are mounting and using the same exactread-only image of the operating system. This only happens with theseSuperMicro nodes, and only with the CentOS 6 on NFSroot. RHEL5 onNFSroot worked fine, and when I installed CentOS 6 on a local disk,the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea whythis is only occuring with RHEL 6 w/ NFS root OS?


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Varying performance across identical cluster nodes.

Reply via email to