On Thu, 15 Nov 2007, David Mathog wrote:
There are some pretty good deals in the low end of the mother board and CPU ranges right now. Not what you folks would buy, but something I'd consider to replace the old Athlon MP's in our 2U cases, one of which just blew up (or the Tyan motherboard, it hardly matters as I don't have spares for either part). It looks like one can buy
Ah, but I do... Yessir, genuine 2466s. Even have a few spare CPUs. And here I am, trying to get myself to throw them away and thereby clean up my office, since at this point one can buy...
a dual core Athlon64, 1 Gb of memory, 1G Lan, and low end VGA on a consumer motherboard for around $150. Maybe less. With the recycled
...which is IIRC less than just one of the Athlon CPUs alone cost. Sigh. And the 2466 sucks. Well, sucked. But if you WANT them and will pay for shipping and are willing to add to your already extensive beer-debt on the rewiring of your house, I'd be happy to ship you what I've got, no guarantees. CPUs still packaged, motherboards may have been removed from packaging but I have no reason to think they don't/won't work.
case, fans, PS, and disks that would be an inexpensive way to more than resuscitate the dead node(s). The one thing that I don't see cheap anywhere is ECC RAM and motherboards that support it. Any of you running clusters without ECC? Has the lack of error correction been a problem?
A very good question. However, as always with systems, one that is very hard to answer without ECC. A single byte somewhere in your system flips a bit. A 0D turns into a 8D. If it is in the middle of a computation unpredictable things occur. Maybe the process crashes, maybe a loop executes a few more times than it should and you get wrong answers. Maybe the answers are egregiously, obviously, horribly wrong, maybe they are subtly wrong, off by a tiny bit. Maybe the bit is in the middle of kernelspace and the system dies horrible almost immediately. maybe it is in the middle of free memory and nothing happens, or cached library pages. All you see is the symptom, however. But systems DO crash. Sometimes from a bit flip, I suppose. Sometimes from a deep bug. Sometimes because they've reached a level of complexity that makes them as "alive" and self-willed as, say, a flatworm or ant or something, and with life comes perversity (do I talk to my computers and try to make them feel welcome and content? I do...). And when they've crashed, well, it's hard to say why they crashed. They're crashed, after all. Sometimes they are kind enough to print out a message as they crash saying "Oops. I've just lost my mind. Please look at the following list of nearly incomprehensible numbers and then kick me in the head." I do my duty and look at those numbers, but rarely am I able to put my finger on byte 23 and say "Aha! That 8D should be a 0D! I must have suffered a Bit Flip!". Besides, more often it just dies, silently and without reprieve or data to retrieve. Not often, though. Given that laptops don't count -- too much going on with networking bopping up and down and 2nd Life's buggy client locking up my entire system with the whiteout screen of death (hadn't seen THAT for a while) -- I still see linux on non-ECC systems being awesomely stable. Awesomely stable on relatively small collections of boxes, however, might not translate to awesome stability on 1024 node clusters - small numbers on one might become annoying numbers on the other. ECC machines do report the errors that they correct, IIRC, at least sometimes. I don't know that I trust them, though, as predictors of non-ECC error rates. If they were, I'd expect more problems than I actually see, although I admit that to really figure out my expectation I'd have to trace down the consequences of flips in all the different pathways above in a probablistic way. Too much work. Simpler to say that if I'm buying systems with OPM for doing professional work where bitflips might give me embarrassingly wrong answers, I cheerfully spend some of the OPM on ECC. If I'm buying systems for myself, for my desktop, for my home cluster/network, or on a small (as opposed to large) chunk of OPM, then I don't worry about it and get consumer-grade systems or motherboards to pop into the cases I already have. rgb
Thanks, David Mathog [EMAIL PROTECTED] Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-- Robert G. Brown Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone(cell): 1-919-280-8443 Web: http://www.phy.duke.edu/~rgb Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf