This may be a long shot, especially in a server room where everything else is working as expected. It may be the case that there is nothing wrong with the machine itself, but rather with the level of power supplied to the machine by the building's wiring. I have seen incorrectly supplied power levels cause unpredictable behaviour.in a machine. But, as I said, it is a long shot.
On Fri, Aug 11, 2017 at 11:35 PM, Chris Samuel <sam...@unimelb.edu.au> wrote: > On Friday, 11 August 2017 12:39:07 AM AEST Faraz Hussain wrote: > > > I thought it may have to do with cpu scaling, i.e when the kernel > > changes the cpu speed depending on the workload. But we do not have > > that enabled on these machines. > > Just to add to the excellent suggestions from others: have you compared > BIOS/ > UEFI settings & versions across these nodes to ensure they're identical? > > Also remember that the kernel can enable C states that hurt performance > even > if they are disabled in the BIOS/UEFI. This was painfully apparent on our > first SandyBridge cluster that almost failed the performance part of > acceptance > testing until it got found. > > Now we boot all nodes with this in the kernel cmdline: > > intel_idle.max_cstate=0 processor.max_cstate=1 intel_pstate=disable > > Best of luck! > Chris > -- > Christopher Samuel Senior Systems Administrator > Melbourne Bioinformatics - The University of Melbourne > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf