Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Joe Landman
On 09/08/2017 02:41 PM, Prentice Bisbal wrote: But here's the thing: this wasn't a problem until we upgraded to CentOS 6. Where I work, we use a read-only NFSroot filesystem for our cluster nodes, so all nodes are mounting and using the same exact read-only image of the operating system. Thi

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Bill Broadley
Last time I saw this problem was because the chassis was missing the air redirection guides, and not enough air was getting to the CPUs. The OS upgrade might actually be enabling better throttling to keep the CPU cooler. ___ Beowulf mailing list, Beowu

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Lux, Jim (337C)
Do you have a temperature probe? One of those IR thermometers? A FLIR One camera for your phone? Then you can quickly check things like heat sink temperatures and surroundings. Air temp is hard to measure quickly and accurately. Jim Lux (818)354-2075 (office) (818)395-2714 (cell) From: Beowul

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Skylar Thompson
I would also suspect a thermal issue, though it could also be firmware. To verify a temperature problem, you might try setting up lm_sensors or scraping "ipmitool sdr" output (whichever is easier) regularly and try to make a performance-vs-temperature plot for each node. As Andrew mentioned, it cou

Re: [Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Andrew Latham
Shooting from hip 1. BIOS identical version and settings 2. Firmware on device (I assume nothing just thinking out loud) 3. Re-seat fans/replace (oxidized contacts - silly but why not) 4. Verify the power supplies are identical (various watts etc... maybe swap out and test) 5. Memory cooling heat-s

[Beowulf] Varying performance across identical cluster nodes.

2017-09-08 Thread Prentice Bisbal
Beowulfers, I need your assistance debugging a problem: I have a dozen servers that are all identical hardware: SuperMicro servers with AMD Opteron 6320 processors. Every since we upgraded to CentOS 6, the users have been complaining of wildly inconsistent performance across these 12 nodes. I