Another thing to perhaps look at. Are you seeing messages abotu thermal throttling events in the system logs? Could that node have a piece of debris caught in its air intake?
I dont think that will produce a 30% drop in perfoemance. But I have caught compute nodes with pieces of packaking sucked onto the front, following careless peeople unpacking kit in machine rooms. (Firm rule - no packaging in the machine room. This means you) On 10 August 2017 at 17:00, John Hearns <hear...@googlemail.com> wrote: > ps. Look at watch cat /proc/interrupts also > You might get a qualitative idea of a huge rate of interrupts. > > > On 10 August 2017 at 16:59, John Hearns <hear...@googlemail.com> wrote: > >> Faraz, >> I think you might have to buy me a virtual coffee. Or a beer! >> Please look at the hardware health of that machine. Specifically the >> DIMMS. I have seen this before! >> If you have some DIMMS which are faulty and are generating ECC errors, >> then if the mcelog service is enabled >> an interrupt is generated for every ECC event. SO the system is spending >> time servicing these interrupts. >> >> So: look in your /var/log/mcelog for hardware errors >> Look in your /var/log/messages for hardware errors also >> Look in the IPMI event logs for ECC errors: ipmitool sel elist >> >> I would also bring that node down and boot it with memtester. >> If there is a DIMM which is that badly faulty then memtester will >> discover it within minutes. >> >> Or it could be something else - in which case I get no coffee. >> >> Also Intel cluster checker is intended to exacly deal with these >> situations. >> What is your cluster manager, and is Intel CLuster Checker available to >> you? >> I would seriously look at getting this installed. >> >> >> >> >> >> >> >> On 10 August 2017 at 16:39, Faraz Hussain <i...@feacluster.com> wrote: >> >>> One of our compute nodes runs ~30% slower than others. It has the exact >>> same image so I am baffled why it is running slow . I have tested OMP and >>> MPI benchmarks. Everything runs slower. The cpu usage goes to 2000%, so all >>> looks normal there. >>> >>> I thought it may have to do with cpu scaling, i.e when the kernel >>> changes the cpu speed depending on the workload. But we do not have that >>> enabled on these machines. >>> >>> Here is a snippet from "cat /proc/cpuinfo". Everything is identical to >>> our other nodes. Any suggestions on what else to check? I have tried >>> rebooting it. >>> >>> processor : 19 >>> vendor_id : GenuineIntel >>> cpu family : 6 >>> model : 62 >>> model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz >>> stepping : 4 >>> cpu MHz : 2500.098 >>> cache size : 25600 KB >>> physical id : 1 >>> siblings : 10 >>> core id : 12 >>> cpu cores : 10 >>> apicid : 56 >>> initial apicid : 56 >>> fpu : yes >>> fpu_exception : yes >>> cpuid level : 13 >>> wp : yes >>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge >>> mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall >>> nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology >>> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 >>> ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt >>> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat xsaveopt pln >>> pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms >>> bogomips : 5004.97 >>> clflush size : 64 >>> cache_alignment : 64 >>> address sizes : 46 bits physical, 48 bits virtual >>> power management: >>> >>> >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >>> >> >> >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf