>(1) Tell your Dell salesman that you have asked for help on this problem on >a public mailing list for High Performance Computing. Tell him/her that you >need high level Dell support on this. There are Dell customers on this list.
Thanks John. I will do that. A question: how likely is it that this is a software issue and not hardware from my symptoms? They keep harping on the fact that I am running a non-validated OS. We used to run Fedora. Now run CentOS. Same issues. They only support RedHat. I have a hard time being 100% certain but the more I see it the more I am convinced it is the hardware. >(2) Suspect the RAM. Ask some serious questions of your Dell support about >RAM compatibility - HPC applications stress the RAM. Ask, and ask again, if >the specific RAM chips you have are certified for that motherboard. Use >dmidecode to read out the manufacturer codes of the RAM modules - do you >have a mix of manufacturers? Very good idea. Never tried that. I will check. I assumed that they were all similar systems and I had compatible RAM since I bought it all packaged together. >Ask and ask again about BIOS updates being available for these machines. >We had a case once of HP machines - even though the BIOSes were versioned >the same on 200 machines, there were some differences - turns out you had to >go as far as checking the build date. >Get the very latest BIOS version you can. I have the latest. But that's only based on the version #. I will dig deeper. Could this be bad BIOS, though, from the symptoms? So, some code somewhere switches the state of that LED from blue to orange and if only I knew what the trigger was supposed to be. Someone had to write that! > >(3) The RAM will be the problem - but if you can keep notes and there are >specific machines which crash more than others point this out to Dell and >maybe suspect the PSUs being weak on those machines. Yes. The crashes seem to be very clustered. We have had 5 specific machines out of 23 crash repeatedly. We swapped the motherboard+cpus on those and they do not seem to have crashed again as yet. But the time scale is only about 2 weeks. So I am not very confident of the statistical significance of my conclusions. -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf