On 09/08/2017 02:41 PM, Prentice Bisbal wrote:
But here's the thing: this wasn't a problem until we upgraded to
CentOS 6. Where I work, we use a read-only NFSroot filesystem for our
cluster nodes, so all nodes are mounting and using the same exact
read-only image of the operating system. Thi
Last time I saw this problem was because the chassis was missing the air
redirection guides, and not enough air was getting to the CPUs.
The OS upgrade might actually be enabling better throttling to keep the CPU
cooler.
___
Beowulf mailing list, Beowu
Do you have a temperature probe? One of those IR thermometers?
A FLIR One camera for your phone?
Then you can quickly check things like heat sink temperatures and surroundings.
Air temp is hard to measure quickly and accurately.
Jim Lux
(818)354-2075 (office)
(818)395-2714 (cell)
From: Beowul
I would also suspect a thermal issue, though it could also be firmware. To
verify a temperature problem, you might try setting up lm_sensors or
scraping "ipmitool sdr" output (whichever is easier) regularly and try to
make a performance-vs-temperature plot for each node. As Andrew mentioned,
it cou
Shooting from hip
1. BIOS identical version and settings
2. Firmware on device (I assume nothing just thinking out loud)
3. Re-seat fans/replace (oxidized contacts - silly but why not)
4. Verify the power supplies are identical (various watts etc... maybe swap
out and test)
5. Memory cooling heat-s
Beowulfers,
I need your assistance debugging a problem:
I have a dozen servers that are all identical hardware: SuperMicro
servers with AMD Opteron 6320 processors. Every since we upgraded to
CentOS 6, the users have been complaining of wildly inconsistent
performance across these 12 nodes. I