Rahul,

I think Greg et al. are correct. Does your SC1435 have a Delta Electronics switching power supply? I bet you have a 600 watt Delta.

Intel recently had problems with outsourced 350 watt "FHJ350WPS" switching power supplies that apparently affected 5% of some server lines. These were loading imbalance problems between the 3.3 volt and 12 volt lines. The affected power supplies had a minimum loading requirement that was not met. The over-voltage protection circuit would kick in on the 3.3V line. However, in these cases, the Intel machines would not reboot. Intel is modifying the 3.3 volt minimum loading from 1.2 amps to 0.2 amps to fix the problem.

        douglas

On Apr 6, 2009, at 12:36 PM, Rahul Nabar wrote:

On Mon, Apr 6, 2009 at 3:08 AM, Greg Lindahl <lind...@pbm.com> wrote:


From your symptoms, the power supply seems to be the next thing to
suspect. From your switch of distros, it's probably not a particular
bad Linux kernel. You have a few completely new machines that don't
hang; move the known good power supplies to other nodes with suspect
mobos and cpus.

Thanks Greg. That could be it. I might give the power-supply idea a shot.

Your university's boilerplate T's & C's probably have some text that
says something like "the stuff you sell us has to work, even if the
way it fails isn't something explicitly discussed in the contract."
But, after an entire year, it will be hard to do anything. You lost
leverage when you paid Dell. It's more likely that Dell will convince
your University purchasing people that you are an idiot than the
reverse.

Well, Dell get's paid almost when they deliver the machines each time.
So there's no leverage there anyways. Just curious: do any of you have
clauses wherein you pay Dell after they have demonstrated trouble free
ops for the first year or some such? We might want to add a similar
clause to our contracts in the light of this experience.


Not really. I don't think there's any global trend among vendors; you
find people with horror stories all over. Have I ever told the story
of the mobo with the exploding caps? 1/1000 chance of blowing up each
time it was power cycled. Kinda obvious in a 1000 node cluster... how
it slipped through the mobo vendor's QA ? ..

Yeah, we got screwed by a similar capacitor issue. The Optiplexes we
were using in our legacy home-brewed cluster (before we started buying
rack servers) had a capacitor-recall. It was a widely-known issue.
Dell started providing us with motherboards on those machines that
crashed because of leaky capacitors. But they convinced us they'd keep
doing it on a machine-by-machine basis and we were happy.

Unfortunately somewhere along the way our warranty ended and the
sys-admin tracking the problem left.  I find some newly dead machines
with the same problem and then they tell me that "the recall has ended
and you are out of warranty" No go.

Which is why I am so desperate to find what our SC1435 problem
actually is and get Dell to do the swapping while we are still safe
under our warranty. We got burnt and this time try to be smarter!

--
Rahul
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to