Another good question. The systems with the nfsroot os still have a
local disk. That local disk has a /var partition where logs are written.
Both system do send some logs to a remote log server. While
/etc/rsyslog.conf files were almost identical, I copied the one from the
nfsroot system to the local-os system to make sure they were identical.
This has had no impact on the performance of xhpl.
Prentice
On 09/13/2017 02:16 PM, Scott Atchley wrote:
Are you logging something goes to the disk in the local case, but that
is competing for network bandwidth when NFS mounting?
On Wed, Sep 13, 2017 at 2:15 PM, Scott Atchley
<e.scott.atch...@gmail.com <mailto:e.scott.atch...@gmail.com>> wrote:
Are you swapping?
On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lath...@gmail.com
<mailto:lath...@gmail.com>> wrote:
ack, so maybe validate you can reproduce with another nfs
root. Maybe a lab setup where a single server is serving nfs
root to the node. If you could reproduce in that way then it
would give some direction. Beyond that it sounds like an
interesting problem.
On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
<pbis...@pppl.gov <mailto:pbis...@pppl.gov>> wrote:
Okay, based on the various responses I've gotten here and
on other lists, I feel I need to clarify things:
This problem only occurs when I'm running our NFSroot
based version of the OS (CentOS 6). When I run the same OS
installed on a local disk, I do not have this problem,
using the same exact server(s). For testing purposes, I'm
using LINPACK, and running the same executable with the
same HPL.dat file in both instances.
Because I'm testing the same hardware using different
OSes, this (should) eliminate the problem being in the
BIOS, and faulty hardware. This leads me to believe it's
most likely a software configuration issue, like a kernel
tuning parameter, or some other software configuration issue.
These are Supermicro servers, and it seems they do not
provide CPU temps. I do see a chassis temp, but not the
temps of the individual CPUs. While I agree that should be
the first thing I look at, it's not an option for me.
Other tools like FLIR and Infrared thermometers aren't
really an option for me, either.
What software configuration, either a kernel a parameter,
configuration of numad or cpuspeed, or some other setting,
could affect this?
Prentice
On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
Beowulfers,
I need your assistance debugging a problem:
I have a dozen servers that are all identical
hardware: SuperMicro servers with AMD Opteron 6320
processors. Every since we upgraded to CentOS 6, the
users have been complaining of wildly inconsistent
performance across these 12 nodes. I ran LINPACK on
these nodes, and was able to duplicate the problem,
with performance varying from ~14 GFLOPS to 64 GFLOPS.
I've identified that performance on the slower nodes
starts off fine, and then slowly degrades throughout
the LINPACK run. For example, on a node with this
problem, during first LINPACK test, I can see the
performance drop from 115 GFLOPS down to 11.3 GFLOPS.
That constant, downward trend continues throughout the
remaining tests. At the start of subsequent tests,
performance will jump up to about 9-10 GFLOPS, but
then drop to 5-6 GLOPS at the end of the test.
Because of the nature of this problem, I suspect this
might be a thermal issue. My guess is that the
processor speed is being throttled to prevent
overheating on the "bad" nodes.
But here's the thing: this wasn't a problem until we
upgraded to CentOS 6. Where I work, we use a read-only
NFSroot filesystem for our cluster nodes, so all nodes
are mounting and using the same exact read-only image
of the operating system. This only happens with these
SuperMicro nodes, and only with the CentOS 6 on
NFSroot. RHEL5 on NFSroot worked fine, and when I
installed CentOS 6 on a local disk, the nodes worked fine.
Any ideas where to look or what to tweak to fix this?
Any idea why this is only occuring with RHEL 6 w/ NFS
root OS?
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
<mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe)
visit http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
--
- Andrew "lathama" Latham lath...@gmail.com
<mailto:lath...@gmail.com> http://lathama.com
<http://lathama.org> -
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
<mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
<http://www.beowulf.org/mailman/listinfo/beowulf>
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf