Re: [Beowulf] Varying performance across identical cluster nodes.

Joe Landman Thu, 14 Sep 2017 06:31:04 -0700


On 09/14/2017 09:25 AM, John Hearns via Beowulf wrote:

Prentice, as I understand it the problem here is that with the sameOS and IB drivers, there is a big difference in performance betweenstateful and NFS root nodes.Throwing my hat into the ring, try looking ot see if there is anexcessive rate of interrupts in the nfsroot case, coming from thenetwork card:
watch cat /proc/interrupts
You will probably need a large terminal window for this (or probablythere is a way to filter the output)


dstat is helpful here.

On 14 September 2017 at 15:14, Prentice Bisbal <pbis...@pppl.gov<mailto:pbis...@pppl.gov>> wrote:


    Good question. I just checked using vmstat. When running xhpl on
    both systems, vmstat shows only zeros for si and so, even long
    after the performance degrades on the nfsroot instance. Just to be
    sure, I double-checked with top, which shows 0k of swap being used.

    Prentice

    On 09/13/2017 02:15 PM, Scott Atchley wrote:

    Are you swapping?

    On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lath...@gmail.com
    <mailto:lath...@gmail.com>> wrote:

        ack, so maybe validate you can reproduce with another nfs
        root. Maybe a lab setup where a single server is serving nfs
        root to the node. If you could reproduce in that way then it
        would give some direction. Beyond that it sounds like an
        interesting problem.

        On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
        <pbis...@pppl.gov <mailto:pbis...@pppl.gov>> wrote:

            Okay, based on the various responses I've gotten here and
            on other lists, I feel I need to clarify things:

            This problem only occurs when I'm running our NFSroot
            based version of the OS (CentOS 6). When I run the same
            OS installed on a local disk, I do not have this problem,
            using the same exact server(s).  For testing purposes,
            I'm using LINPACK, and running the same executable  with
            the same HPL.dat file in both instances.

            Because I'm testing the same hardware using different
            OSes, this (should) eliminate the problem being in the
            BIOS, and faulty hardware. This leads me to believe it's
            most likely a software configuration issue, like a kernel
            tuning parameter, or some other software configuration issue.

            These are Supermicro servers, and it seems they do not
            provide CPU temps. I do see a chassis temp, but not the
            temps of the individual CPUs. While I agree that should
            be the first thing I look at, it's not an option for me.
            Other tools like FLIR and Infrared thermometers aren't
            really an option for me, either.

            What software configuration, either a kernel a parameter,
            configuration of numad or cpuspeed, or some other
            setting, could affect this?

            Prentice

            On 09/08/2017 02:41 PM, Prentice Bisbal wrote:

                Beowulfers,

                I need your assistance debugging a problem:

                I have a dozen servers that are all identical
                hardware: SuperMicro servers with AMD Opteron 6320
                processors. Every since we upgraded to CentOS 6, the
                users have been complaining of wildly inconsistent
                performance across these 12 nodes. I ran LINPACK on
                these nodes, and was able to duplicate the problem,
                with performance varying from ~14 GFLOPS to 64 GFLOPS.

                I've identified that performance on the slower nodes
                starts off fine, and then slowly degrades throughout
                the LINPACK run. For example, on a node with this
                problem, during first LINPACK test, I can see the
                performance drop from 115 GFLOPS down to 11.3 GFLOPS.
                That constant, downward trend continues throughout
                the remaining tests. At the start of subsequent
                tests, performance will jump up to about 9-10 GFLOPS,
                but then drop to 5-6 GLOPS at the end of the test.

                Because of the nature of this problem, I suspect this
                might be a thermal issue. My guess is that the
                processor speed is being throttled to prevent
                overheating on the "bad" nodes.

                But here's the thing: this wasn't a problem until we
                upgraded to CentOS 6. Where I work, we use a
                read-only NFSroot filesystem for our cluster nodes,
                so all nodes are mounting and using the same exact
                read-only image of the operating system. This only
                happens with these SuperMicro nodes, and only with
                the CentOS 6 on NFSroot. RHEL5 on NFSroot worked
                fine, and when I installed CentOS 6 on a local disk,
                the nodes worked fine.

                Any ideas where to look or what to tweak to fix this?
                Any idea why this is only occuring with RHEL 6 w/ NFS
                root OS?


            _______________________________________________
            Beowulf mailing list, Beowulf@beowulf.org
            <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
            To change your subscription (digest mode or unsubscribe)
            visit http://www.beowulf.org/mailman/listinfo/beowulf
            <http://www.beowulf.org/mailman/listinfo/beowulf>

--- Andrew "lathama" Latham lath...@gmail.com

        <mailto:lath...@gmail.com> http://lathama.com
        <http://lathama.org> -

        _______________________________________________
        Beowulf mailing list, Beowulf@beowulf.org
        <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
        To change your subscription (digest mode or unsubscribe)
        visit http://www.beowulf.org/mailman/listinfo/beowulf
        <http://www.beowulf.org/mailman/listinfo/beowulf>



    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org
    <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    http://www.beowulf.org/mailman/listinfo/beowulf
    <http://www.beowulf.org/mailman/listinfo/beowulf>




_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


--
Joe Landman
e: joe.land...@gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Varying performance across identical cluster nodes.

Reply via email to