I put €10 on the nose for a faulty power supply. On 10 August 2017 at 19:45, Gus Correa <g...@ldeo.columbia.edu> wrote:
> + Leftover processes from previous jobs hogging resources. > That's relatively common. > That can trigger swapping, the ultimate performance killer. > "top" or "htop" on the node should show something. > (Will go away with a reboot, of course.) > > Less likely, but possible: > > + Different BIOS configuration w.r.t. the other nodes. > > + Poorly sat memory, IB card, etc, or cable connections. > > + IPMI may need a hard reset. > Power down, remove the power cable, wait several minutes, > put the cable back, power on. > > Gus Correa > > On 08/10/2017 11:17 AM, John Hearns via Beowulf wrote: > >> Another thing to perhaps look at. Are you seeing messages abotu thermal >> throttling events in the system logs? >> Could that node have a piece of debris caught in its air intake? >> >> I dont think that will produce a 30% drop in perfoemance. But I have >> caught compute nodes with pieces of packaking sucked onto the front, >> following careless peeople unpacking kit in machine rooms. >> (Firm rule - no packaging in the machine room. This means you) >> >> >> >> >> On 10 August 2017 at 17:00, John Hearns <hear...@googlemail.com <mailto: >> hear...@googlemail.com>> wrote: >> >> ps. Look at watch cat /proc/interrupts also >> You might get a qualitative idea of a huge rate of interrupts. >> >> >> On 10 August 2017 at 16:59, John Hearns <hear...@googlemail.com >> <mailto:hear...@googlemail.com>> wrote: >> >> Faraz, >> I think you might have to buy me a virtual coffee. Or a beer! >> Please look at the hardware health of that machine. Specifically >> the DIMMS. I have seen this before! >> If you have some DIMMS which are faulty and are generating ECC >> errors, then if the mcelog service is enabled >> an interrupt is generated for every ECC event. SO the system is >> spending time servicing these interrupts. >> >> So: look in your /var/log/mcelog for hardware errors >> Look in your /var/log/messages for hardware errors also >> Look in the IPMI event logs for ECC errors: ipmitool sel elist >> >> I would also bring that node down and boot it with memtester. >> If there is a DIMM which is that badly faulty then memtester >> will discover it within minutes. >> >> Or it could be something else - in which case I get no coffee. >> >> Also Intel cluster checker is intended to exacly deal with these >> situations. >> What is your cluster manager, and is Intel CLuster Checker >> available to you? >> I would seriously look at getting this installed. >> >> >> >> >> >> >> >> On 10 August 2017 at 16:39, Faraz Hussain <i...@feacluster.com >> <mailto:i...@feacluster.com>> wrote: >> >> One of our compute nodes runs ~30% slower than others. It >> has the exact same image so I am baffled why it is running >> slow . I have tested OMP and MPI benchmarks. Everything runs >> slower. The cpu usage goes to 2000%, so all looks normal >> there. >> >> I thought it may have to do with cpu scaling, i.e when the >> kernel changes the cpu speed depending on the workload. But >> we do not have that enabled on these machines. >> >> Here is a snippet from "cat /proc/cpuinfo". Everything is >> identical to our other nodes. Any suggestions on what else >> to check? I have tried rebooting it. >> >> processor : 19 >> vendor_id : GenuineIntel >> cpu family : 6 >> model : 62 >> model name : Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz >> stepping : 4 >> cpu MHz : 2500.098 >> cache size : 25600 KB >> physical id : 1 >> siblings : 10 >> core id : 12 >> cpu cores : 10 >> apicid : 56 >> initial apicid : 56 >> fpu : yes >> fpu_exception : yes >> cpuid level : 13 >> wp : yes >> flags : fpu vme de pse tsc msr pae mce cx8 apic >> sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr >> sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm >> constant_tsc arch_perfmon pebs bts rep_good xtopology >> nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl >> vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 >> x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand >> lahf_lm ida arat xsaveopt pln pts dts tpr_shadow vnmi >> flexpriority ept vpid fsgsbase smep erms >> bogomips : 5004.97 >> clflush size : 64 >> cache_alignment : 64 >> address sizes : 46 bits physical, 48 bits virtual >> power management: >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) >> visit http://www.beowulf.org/mailman/listinfo/beowulf >> <http://www.beowulf.org/mailman/listinfo/beowulf> >> >> >> >> >> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf