Hi all, I agree with Vincent regarding EEC, I think it is really mandatory for a cluster which does number crunching.
However, the best cluster does not help if the deployed code does not have a test suite to verify the installation. Believe me, that is not an expection, I know a number of chemistry codes which are used in practise and there is not test suite, or the test suite is broken and it actually says on the code's webpage: don't bother using the test suite, it is broken and we know it. So you need both: good hardware _and_ good software with a test suite to generate meaningful results. If one of the requirements is not met, we might as well throw a dice which is cheaper ;-) All the best from a wet London Jörg On Sonntag 04 November 2012 Vincent Diepeveen wrote: > On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote: > > On 11/3/12 6:55 PM, "Robin Whittle" <r...@firstpr.com.au> wrote: > >> <snip> > > [snip] > > >> For serious work, the cluster and its software needs to survive power > >> outages, failure of individual servers and memory errors, so ECC > >> memory > >> is a good investment . . . which typically requires more expensive > >> motherboards and CPUs. > > > > Actually, I don't know that I would agree with you about ECC, etc. > > ECC > > memory is an attempt to create "perfect memory". As you scale up, the > > assumption of "perfect computation" becomes less realistic, so that > > means > > your application (or the infrastructure on which the application > > sits) has > > to explicitly address failures, because at sufficiently large > > scale, they > > are inevitable. Once you've dealt with that, then whether ECC is > > needed > > or not (or better power supplies, or cooling fans, or lunar gravity > > phase > > compensation, or whatever) is part of your computational design and > > budget: it might be cheaper (using whatever metric) to > > overprovision and > > allow errors than to buy fewer better widgets. > > I don't know whether for all clusters 'outages' is a big issue - here > in Western Europe we hardly have > power failures, so i would imagine it if a company with a cluster > doesn't invest into batterypacks, > as their company won't be able to run anyway if there isn't power. > > More interesting is the ECC discussion. > > ECC is simply a requirement IMHO, not a 'luxury thing' as some > hardware engineers see it. > > I know some memory engineers disagree here - for example one of them > mentionned to me that "putting ECC onto a GPU > is nonsense as it is a lot of effort and DDR5 already has a built in > CRC" something like that (if i remember the quote correctly). > > But they do not administer servers themselves. > > Also they don't understand the accuracy or better LACK of accuracy in > checking calculations done by > some who calculate at big iron. If you calculate at a cluster and get > after some months a result - reality is simply that > 99% of the researchers isn't as good as the Einstein league > researchers and 90% simply sucks too much by any standards > in this sense that they wouldn't see an obvious problem get generated > by a bitflip here or there. They just would > happily invent a new theory, as we already have seen too much in > history. > > By simply putting in ECC there you avoid in some percent of the cases > this 'interpreting the results correctly' problem. > > Furthermore there is too many calculations where a single bitflip > could be catastrophic and calculating > for a few months at hundreds of cores is asking for trouble then > without ECC. > > As last argument i want to note that in many sciences we simply see > that the post 2nd world war standard of using alpha = 0.05 > or an error of at most 5% (2 x standard deviation), simply isn't > accurate enough anymore for todays generation of scientists. > > They need more accuracy. > > So historic debates on what is enough or what isn't enough - reducing > errors by means of using ECC is really important. > > Now that said - if someone shows up with a different form of checking > that's just as accurate or even better - that would be > acceptable as well - yet most discussions usually with the hardware > engineers are typically like: "why do all this effort to get > rid of a few errors meanwhile my windows laptop if it crashes i just > reboot it". > > Such sorts of discussions really should be discussions of the past - > society is moving on - one needs a far higher accuracy and > reliability now - simply as the CPU's do more calculations and the > Memory therefore has to serve more bytes per second. > > In all that ECC is a requirement for huge clusters and from my > viewpoint also for relative tiny clusters. > > >> I understand that the most serious limitation of this approach is the > >> bandwidth and latency (how long it takes for a message to get to the > >> destination server) of 1Gbps Ethernet. The most obvious alternatives > >> are using multiple 1Gbps Ethernet connections per server (but this is > >> complex and only marginally improves bandwidth, while doing little or > >> nothing for latency) or upgrading to Infiniband. As far as I know, > >> Infiniband is exotic and expensive compared to the mass market > >> motherboards etc. from which a Beowulf cluster can be made. In other > >> words, I think Infiniband is required to make a cluster work really > >> well, but it does not not (yet) meet the original Beowulf goal of > >> being > >> inexpensive and commonly available. > > > > Perhaps a distinction should be made between "original Beowulf" and > > "cluster computer"? As you say, the original idea (espoused in the > > book, > > etc.) is a cluster built from cheap commodity parts. That would mean > > "commodity packaging", "commodity interconnects", etc. which for > > the most > > part meant tower cases and ethernet. However, cheap custom sheet > > metal is > > now available (back when Beowulfs were first being built, rooms > > full of > > servers were still a fairly new and novel thing, and you paid a > > significant premium for rack mount chassis, especially as consumer > > pressure forced the traditional tower case prices down) > > > >> I think this model of HPC cluster computing remains fundamentally > >> true, > >> but there are two important developments in recent years which either > >> alter the way a cluster would be built or used or which may make the > >> best solution to a computing problem no longer a cluster. These > >> developments are large numbers of CPU cores per server, and the > >> use of > >> GPUs to do massive amounts of computing, in a single inexpensive > >> graphic > >> card - more crunching than was possible in massive clusters a decade > >> earlier. > > > > Yes. But in some ways, utilizing them has the same sort of software > > problem as using multiple nodes in the first place (EP aside). And > > the > > architecture of the interconnects is heterogeneous compared to the > > fairly > > uniform interconnect of a generalized cluster fabric. One can > > raise the > > same issues with cache, by the way. > > > >> The ideal computing system would have a single CPU core which > >> could run > >> at arbitrarily high frequencies, with low latency, high bandwidth, > >> access to an arbitrarily large amount of RAM, with matching links to > >> hard disks or other non-volatile storage systems, with a good > >> Ethernet > >> link to the rest of the world. > >> > >> While CPU clock frequencies and computing effort per clock frequency > >> have been growing slowly for the last 10 years or so, there has > >> been a > >> continuing increase in the number of CPU cores per CPU device > >> (typically > >> a single chip, but sometimes multiple chips in a device which is > >> plugged > >> into the motherboard) and in the number of CPU devices which can be > >> plugged into a motherboard. > > > > That's because CPU clock is limited by physics. "work per clock > > cycle" is > > also limited by physics to a certain extent (because today's > > processors > > are mostly synchronous, so you have a propagation delay time from > > one side > > of the processor to the other) except for things like array processors > > (SIMD) but I'd say that's just multiple processors that happen to > > be doing > > the same thing, rather than a single processor doing more. > > > > The real force driving multiple cores is the incredible expense of > > getting > > on and off chip. Moving a bit across the chip is easy, compared to > > off > > chip: you have to change the voltage levels, have enough current > > to drive > > a trace, propagate down that trace, receive the signal at the other > > end, > > shift voltages again. > > > >> Most mass market motherboards are for a single CPU device, but > >> there are > >> a few two and four CPU motherboards for Intel and AMD CPUs. > >> > >> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU > >> cores > >> per CPU device. I think the 4 core i7 CPUs or their ECC- > >> compatible Xeon > >> equivalents are marginally faster than those with 6 or 8 cores. > >> > >> In all cases, as far as I know, combining multiple CPU cores and/or > >> multiple CPU devices results in a single computer system, with a > >> single > >> operating system and a single body of memory, with multiple CPU cores > >> all running around in this shared memory. > > > > Yes.. That's a fairly simple model and easy to program for. > > > >> I have no clear idea how each > >> > >> CPU core knows what the other cores have written to the RAM they are > >> using, since each core is reading and writing via its own cache of > >> the > >> memory contents. This raises the question of inter-CPU-core > >> communications, within a single CPU chip, between chips in a multi- > >> chip > >> CPU module, and between multiple CPU modules on the one motherboard. > > > > Generally handled by the OS kernel. In a multitasking OS, the > > scheduler > > just assigns the next free CPU to the next task. Whether you > > restore the > > context from processor A to processor A or to processor B doesn't make > > much difference. Obviously, there are cache issues (since that's > > part of > > context). This kind of thing is why multiprocessor kernels are non- > > trivial. > > > >> I understand that MPI works identically from the programmer's > >> perspective between CPU-cores on a shared memory computer as between > >> CPU-cores on separate servers. However, the performance (low latency > >> and high bandwidth) of these communications within a single shared > >> memory system is vastly higher than between any separate servers, > >> which > >> would rely on Infiniband or Ethernet. > > > > Yes. This is a problem with a simple interconnect model.. It doesn't > > necessarily reflect the cost of the interconnect is different > > depending on > > how far and how fast you're going. That said, there is a fair > > amount of > > research into this. Hypercube processors had limited interconnects > > between nodes (only nearest neighbor) and there are toroidal > > fabrics (2D > > interconnects) as well. > > > >> So even if you have, or are going to write, MPI-based software > >> which can > >> run on a cluster, there may be an argument for not building a > >> cluster as > >> such, but for building a single motherboard system with as many as 64 > >> CPU cores. > > > > Sure.. If your problem is of a size that it can be solved by a > > single box, > > then that's usually the way to go. (It applies in areas outside of > > computing.. Better to have one big transmitter tube than lots of > > little > > ones). But it doesn't scale. The instant the problem gets too big, > > then > > you're stuck. The advantage of clusters is that they are > > scalable. Your > > problem gets 2x bigger, in theory, you add another N nodes and you're > > ready to go (Amdahl's law can bite you though). > > > > There's even been a lot of discussion over the years on this list > > about > > the optimum size cluster to build for a big task, given that > > computers are > > getting cheaper/more powerful. If you've got 2 years worth of > > computing, > > do you buy a computer today that can finish the job in 2 years, or > > do you > > do nothing for a year and buy a computer that is twice as fast in a > > year. > > > >> I think the major new big academic cluster projects focus on > >> getting as > >> many CPU cores as possible into a single server, while minimising > >> power > >> consumption per unit of compute power, and then hooking as many as > >> possible of these servers together with Infiniband. > > > > That might be an aspect of trying to make a general purpose computing > > resource within a specified budget. > > > >> Here is a somewhat rambling discussion of my own thoughts regarding > >> clusters and multi-core machines, for my own purposes. My > >> interests in > >> high performance computing involve music synthesis and physics > >> simulation. > >> > >> There is an existing, single-threaded (written in C, can't be made > >> multithreaded in any reasonable manner) music synthesis program > >> called > >> Csound. I want to use this now, but as a language for synthesis, I > >> think it is extremely clunky. So I plan to write my own program - > >> one > >> day . . . When I do, it will be written in C++ and > >> multithreaded, so > >> it will run nicely on multiple CPU-cores in a single machine. > >> Writing > >> and debugging a multithreaded program is more complex than doing > >> so for > >> a single-threaded program, but I think it will be practical and a lot > >> easier than writing and debugging an MPI based program running > >> either on > >> on multiple servers or on multiple CPU-cores on a single server. > > > > Maybe, maybe not. How is your interthread communication architecture > > structured? Once you bite the bullet and go with a message passing > > model, > > it's a lot more scalable, because you're not doing stuff like "shared > > memory". > > > >> I want to do some simulation of electromagnetic wave propagation > >> using > >> an existing and widely used MPI-based (C++, open source) program > >> called > >> Meep. This can run as a single thread, if there is enough RAM, or > >> the > >> problem can be split up to run over multiple threads using MPI > >> communication between the threads. If this is done on a single > >> server, > >> then the MPI communication is done really quickly, via shared memory, > >> which is vastly faster than using Ethernet or Inifiniband to other > >> servers. However, this places a limit on the number of CPU-cores and > >> the total memory. When simulating three dimensional models, the > >> RAM and > >> CPU demands can easily become extremely demanding. Meep was > >> written to > >> split the problem into multiple zones, and to work efficiently > >> with MPI. > > > > As you note, this is advantage of setting up a message passing > > architecture from the beginning.. It works regardless of the scale/ > > method > > of message passing. There *are* differences in performance. > > > >> Ten or 15 years ago, the only way to get more compute power was to > >> build > >> a cluster and therefore to write the software to use MPI. This was > >> because CPU-devices had a single core (Intel Pentium 3 and 4) and > >> because it was rare to find motherboards which handled multiple such > >> chips. > > > > Yes > > > >> The next step would be to get a 4 socket motherboard from Tyan or > >> SuperMicro for $800 or so and populate it with 8, 12 or (if money > >> permits) 16 core CPUs and a bunch of ECC RAM. > >> > >> My forthcoming music synthesis program would run fine with 8 or > >> 16GB of > >> RAM. So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron > >> machines would do the trick nicely. > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > > Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- ************************************************************* Jörg Saßmannshausen University College London Department of Chemistry Gordon Street London WC1H 0AJ email: j.sassmannshau...@ucl.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf