----- "Chris Samuel" <csam...@vpac.org> wrote: > Very occasionally we find one of our Barcelona nodes with > a SuperMicro H8DM8-2 motherboard powered off. IPMI reports > it as powered down too. > > No kernel panic, no crash, nothing in the system logs.
I thought that people might be interested in this update, we'd been trying swapping virtually everything short of the label on the front of the case (including trying much higher capacity PSUs) to no avail. We had one node that I could reliably power off in about 30 seconds to 1 minute by running a certain Gaussian job which was used as our test platform (other nodes were far more random, and we've seen this issue on about 1/2 of the 95 nodes so far). We decided to try 2.6.29-rc1 out in case some of the extra debug info (e.g. commit 8652cb4b0d87accbe78725fd2a13be2787059649) helped and were amazed to find that I could no longer kill it, the Gaussian job ran to completion in about 2 days. We rebooted back to 2.6.28 (not without issues [1]) and I killed it again in about 30 seconds. Rebooted back into 2.6.29-rc1 and it ran happily again. So whilst I am not saying that the problem is solved (we would need to see a large proportion of the cluster running jobs without poweroffs first) I can at least say that it does seem to be mitigating the problem on this specific node. We are now doing a sort of reverse bisection to try and figure out what fixed it which is going to take a little time! ;-) We've got to be careful as the git bisect tool doesn't let you have the "good" revision after the "bad" one, it assumes you're trying to find something that broke rather than something that fixed a problem, so we have to remember to say "git bisect bad" when it's good and "git bisect good" when it's bad. ;-) cheers, Chris [1] - The one fly in the ointment is that when we reboot back into 2.6.28 eth1 can no longer negotiate with our gigabit switch, we think this is due to some nforce driver changes, possible commit cb52deba12f27af90a46d2f8667a64888118a888. -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf