Re: [Beowulf] Varying performance across identical cluster nodes.

Prentice Bisbal Thu, 14 Sep 2017 06:27:03 -0700

Switching away from NFS root is not something I can change right now.


Prentice

On 09/13/2017 02:45 PM, Joe Landman wrote:

FWIW: I gave up on NFS boot a while ago, due in part to problems withperformance that were hard to track down. The environment I createdto do completely ramboot boots at scale, allows me to pivot to NFS ifdesired (boot time switch). But I rarely use that. Pure ramboot hasbeen a joy to work with as compared to NFS.
On 09/13/2017 01:48 PM, Prentice Bisbal wrote:
Okay, based on the various responses I've gotten here and on otherlists, I feel I need to clarify things:
This problem only occurs when I'm running our NFSroot based versionof the OS (CentOS 6). When I run the same OS installed on a localdisk, I do not have this problem, using the same exact server(s). For testing purposes, I'm using LINPACK, and running the sameexecutable with the same HPL.dat file in both instances.
Because I'm testing the same hardware using different OSes, this(should) eliminate the problem being in the BIOS, and faultyhardware. This leads me to believe it's most likely a softwareconfiguration issue, like a kernel tuning parameter, or some othersoftware configuration issue.
These are Supermicro servers, and it seems they do not provide CPUtemps. I do see a chassis temp, but not the temps of the individualCPUs. While I agree that should be the first thing I look at, it'snot an option for me. Other tools like FLIR and Infrared thermometersaren't really an option for me, either.
What software configuration, either a kernel a parameter,configuration of numad or cpuspeed, or some other setting, couldaffect this?
Prentice

On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
Beowulfers,

I need your assistance debugging a problem:
I have a dozen servers that are all identical hardware: SuperMicroservers with AMD Opteron 6320 processors. Every since we upgraded toCentOS 6, the users have been complaining of wildly inconsistentperformance across these 12 nodes. I ran LINPACK on these nodes, andwas able to duplicate the problem, with performance varying from ~14GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes starts offfine, and then slowly degrades throughout the LINPACK run. Forexample, on a node with this problem, during first LINPACK test, Ican see the performance drop from 115 GFLOPS down to 11.3 GFLOPS.That constant, downward trend continues throughout the remainingtests. At the start of subsequent tests, performance will jump up toabout 9-10 GFLOPS, but then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this might be athermal issue. My guess is that the processor speed is beingthrottled to prevent overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we upgraded toCentOS 6. Where I work, we use a read-only NFSroot filesystem forour cluster nodes, so all nodes are mounting and using the sameexact read-only image of the operating system. This only happenswith these SuperMicro nodes, and only with the CentOS 6 on NFSroot.RHEL5 on NFSroot worked fine, and when I installed CentOS 6 on alocal disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this? Any idea whythis is only occuring with RHEL 6 w/ NFS root OS?
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visithttp://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Varying performance across identical cluster nodes.

Reply via email to