On Mon, Apr 6, 2009 at 1:32 PM, Frank Gruellich <frank.gruell...@navteq.com> wrote: > IMHO SC1435 are some kind of low-cost metal from DELL. I would not use > them if I want a reliable system. Especially in HPC where one failed > systems ruins your whole (maybe long running) job.
Thanks for the comments Frank. I did not realize that the SC1435 wasn't suitable for HPC. I know it is one of the lower end systems without RemoteManagement nor hot-swappable-hardware etc. (but we don't really need the frills) but I was under the impression that this model is fairly common in other HPC installations. Maybe we were wrong, in hindsight. > > The DELL support is a bit tricky. We have Silver or Gold support for > most systems, I don't know how they work for lower levels. I can't > complain about Gold. For Silver they always try to make us doing stuff > like cross testing memory, CPU or other things. (The most interesting > request is to do a BIOS update to cure a (obviously) memory problem. > The machine went 2 years fine with the old BIOS -- memory combination > and suddenly it complains about it?) While I really like to do such > hardware games I just don't have the time for it. If you keep refusing > these requests, eventually they give up and send a technican replacing > different pieces of hardware. I ought to check if we are "Gold" or "Silver" or none. Yes, the BIOS update gig I am familiar with. I can quote their debug checklist from memory almost. They made me confirm and update BIOSes too. It was funny especially since it hadn't been even a month after we bought them but the tech insisted our BIOS was *not* up-to-date back then. We fixed it but I always wonder why they do not just ship out up-to-date versions of the BIOS! > > We use CentOS for most installation and DELL support never complained > about it. And IMHO the OS should be able to cause an error detected by > the management board. Exactly, my opinion. It seems clearly a hardware level fault and the OS angle seems mostly smoke-and-mirrors to me. I cannot explain why the system will not reboot by pressing the reboot button if it were a simple software crash. > > I have dset reports in place, before calling support, because they > always request them. That speeds up chit-chat a bit. Yes, dset and sosreports seem standard requests. > > That's another problem: IMHO your university should have a dedicated guy > taking care about computer system, someone who has the time to deal with > DELL support and so on. 23 machines don't give a full time job, but > maybe someone who's taking care about some other Linux installation > already. It's not a good idea to have just some grad-student doing that > job part-time (no offense). I know that reallity looks bad. Ah well, one does what one needs to! :) These are dedicated research machines for our computational chemistry group so they will be running code that eventually (hopefully!) puts results into my PhD thesis! :) Most parts of system administration are fun except maybe having to deal with stubborn vendors! -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf