Are you swapping? On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lath...@gmail.com> wrote:
> ack, so maybe validate you can reproduce with another nfs root. Maybe a > lab setup where a single server is serving nfs root to the node. If you > could reproduce in that way then it would give some direction. Beyond that > it sounds like an interesting problem. > > On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal <pbis...@pppl.gov> > wrote: > >> Okay, based on the various responses I've gotten here and on other lists, >> I feel I need to clarify things: >> >> This problem only occurs when I'm running our NFSroot based version of >> the OS (CentOS 6). When I run the same OS installed on a local disk, I do >> not have this problem, using the same exact server(s). For testing >> purposes, I'm using LINPACK, and running the same executable with the same >> HPL.dat file in both instances. >> >> Because I'm testing the same hardware using different OSes, this (should) >> eliminate the problem being in the BIOS, and faulty hardware. This leads me >> to believe it's most likely a software configuration issue, like a kernel >> tuning parameter, or some other software configuration issue. >> >> These are Supermicro servers, and it seems they do not provide CPU temps. >> I do see a chassis temp, but not the temps of the individual CPUs. While I >> agree that should be the first thing I look at, it's not an option for me. >> Other tools like FLIR and Infrared thermometers aren't really an option for >> me, either. >> >> What software configuration, either a kernel a parameter, configuration >> of numad or cpuspeed, or some other setting, could affect this? >> >> Prentice >> >> On 09/08/2017 02:41 PM, Prentice Bisbal wrote: >> >>> Beowulfers, >>> >>> I need your assistance debugging a problem: >>> >>> I have a dozen servers that are all identical hardware: SuperMicro >>> servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS >>> 6, the users have been complaining of wildly inconsistent performance >>> across these 12 nodes. I ran LINPACK on these nodes, and was able to >>> duplicate the problem, with performance varying from ~14 GFLOPS to 64 >>> GFLOPS. >>> >>> I've identified that performance on the slower nodes starts off fine, >>> and then slowly degrades throughout the LINPACK run. For example, on a node >>> with this problem, during first LINPACK test, I can see the performance >>> drop from 115 GFLOPS down to 11.3 GFLOPS. That constant, downward trend >>> continues throughout the remaining tests. At the start of subsequent tests, >>> performance will jump up to about 9-10 GFLOPS, but then drop to 5-6 GLOPS >>> at the end of the test. >>> >>> Because of the nature of this problem, I suspect this might be a thermal >>> issue. My guess is that the processor speed is being throttled to prevent >>> overheating on the "bad" nodes. >>> >>> But here's the thing: this wasn't a problem until we upgraded to CentOS >>> 6. Where I work, we use a read-only NFSroot filesystem for our cluster >>> nodes, so all nodes are mounting and using the same exact read-only image >>> of the operating system. This only happens with these SuperMicro nodes, and >>> only with the CentOS 6 on NFSroot. RHEL5 on NFSroot worked fine, and when I >>> installed CentOS 6 on a local disk, the nodes worked fine. >>> >>> Any ideas where to look or what to tweak to fix this? Any idea why this >>> is only occuring with RHEL 6 w/ NFS root OS? >>> >>> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > > -- > - Andrew "lathama" Latham lath...@gmail.com http://lathama.com > <http://lathama.org> - > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf