Re: [Beowulf] Varying performance across identical cluster nodes.

Prentice Bisbal Thu, 14 Sep 2017 06:15:53 -0700

Good question. I just checked using vmstat. When running xhpl on bothsystems, vmstat shows only zeros for si and so, even long after theperformance degrades on the nfsroot instance. Just to be sure, Idouble-checked with top, which shows 0k of swap being used.


Prentice


On 09/13/2017 02:15 PM, Scott Atchley wrote:

Are you swapping?

On Wed, Sep 13, 2017 at 2:14 PM, Andrew Latham <lath...@gmail.com<mailto:lath...@gmail.com>> wrote:


    ack, so maybe validate you can reproduce with another nfs root.
    Maybe a lab setup where a single server is serving nfs root to the
    node. If you could reproduce in that way then it would give some
    direction. Beyond that it sounds like an interesting problem.

    On Wed, Sep 13, 2017 at 12:48 PM, Prentice Bisbal
    <pbis...@pppl.gov <mailto:pbis...@pppl.gov>> wrote:

        Okay, based on the various responses I've gotten here and on
        other lists, I feel I need to clarify things:

        This problem only occurs when I'm running our NFSroot based
        version of the OS (CentOS 6). When I run the same OS installed
        on a local disk, I do not have this problem, using the same
        exact server(s).  For testing purposes, I'm using LINPACK, and
        running the same executable  with the same HPL.dat file in
        both instances.

        Because I'm testing the same hardware using different OSes,
        this (should) eliminate the problem being in the BIOS, and
        faulty hardware. This leads me to believe it's most likely a
        software configuration issue, like a kernel tuning parameter,
        or some other software configuration issue.

        These are Supermicro servers, and it seems they do not provide
        CPU temps. I do see a chassis temp, but not the temps of the
        individual CPUs. While I agree that should be the first thing
        I look at, it's not an option for me. Other tools like FLIR
        and Infrared thermometers aren't really an option for me, either.

        What software configuration, either a kernel a parameter,
        configuration of numad or cpuspeed, or some other setting,
        could affect this?

        Prentice

        On 09/08/2017 02:41 PM, Prentice Bisbal wrote:

            Beowulfers,

            I need your assistance debugging a problem:

            I have a dozen servers that are all identical hardware:
            SuperMicro servers with AMD Opteron 6320 processors. Every
            since we upgraded to CentOS 6, the users have been
            complaining of wildly inconsistent performance across
            these 12 nodes. I ran LINPACK on these nodes, and was able
            to duplicate the problem, with performance varying from
            ~14 GFLOPS to 64 GFLOPS.

            I've identified that performance on the slower nodes
            starts off fine, and then slowly degrades throughout the
            LINPACK run. For example, on a node with this problem,
            during first LINPACK test, I can see the performance drop
            from 115 GFLOPS down to 11.3 GFLOPS. That constant,
            downward trend continues throughout the remaining tests.
            At the start of subsequent tests, performance will jump up
            to about 9-10 GFLOPS, but then drop to 5-6 GLOPS at the
            end of the test.

            Because of the nature of this problem, I suspect this
            might be a thermal issue. My guess is that the processor
            speed is being throttled to prevent overheating on the
            "bad" nodes.

            But here's the thing: this wasn't a problem until we
            upgraded to CentOS 6. Where I work, we use a read-only
            NFSroot filesystem for our cluster nodes, so all nodes are
            mounting and using the same exact read-only image of the
            operating system. This only happens with these SuperMicro
            nodes, and only with the CentOS 6 on NFSroot. RHEL5 on
            NFSroot worked fine, and when I installed CentOS 6 on a
            local disk, the nodes worked fine.

            Any ideas where to look or what to tweak to fix this? Any
            idea why this is only occuring with RHEL 6 w/ NFS root OS?


        _______________________________________________
        Beowulf mailing list, Beowulf@beowulf.org
        <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
        To change your subscription (digest mode or unsubscribe) visit
        http://www.beowulf.org/mailman/listinfo/beowulf
        <http://www.beowulf.org/mailman/listinfo/beowulf>

--- Andrew "lathama" Latham lath...@gmail.com

    <mailto:lath...@gmail.com> http://lathama.com <http://lathama.org> -

    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org
    <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    http://www.beowulf.org/mailman/listinfo/beowulf
    <http://www.beowulf.org/mailman/listinfo/beowulf>

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Varying performance across identical cluster nodes.

Reply via email to