Am 04.11.2012 um 19:06 schrieb Jörg Saßmannshausen: > Hi all, > > I agree with Vincent regarding EEC, I think it is really mandatory for a > cluster which does number crunching. > > However, the best cluster does not help if the deployed code does not have a > test suite to verify the installation.
...and any update/patch. Once you upgrade the kernel and/or libraries the test suite has to be run again. -- Reuti > Believe me, that is not an expection, I > know a number of chemistry codes which are used in practise and there is not > test suite, or the test suite is broken and it actually says on the code's > webpage: don't bother using the test suite, it is broken and we know it. > > So you need both: good hardware _and_ good software with a test suite to > generate meaningful results. If one of the requirements is not met, we might > as well throw a dice which is cheaper ;-) > > All the best from a wet London > > Jörg > > > On Sonntag 04 November 2012 Vincent Diepeveen wrote: >> On Nov 4, 2012, at 5:53 PM, Lux, Jim (337C) wrote: >>> On 11/3/12 6:55 PM, "Robin Whittle" <r...@firstpr.com.au> wrote: >>>> <snip> >> >> [snip] >> >>>> For serious work, the cluster and its software needs to survive power >>>> outages, failure of individual servers and memory errors, so ECC >>>> memory >>>> is a good investment . . . which typically requires more expensive >>>> motherboards and CPUs. >>> >>> Actually, I don't know that I would agree with you about ECC, etc. >>> ECC >>> memory is an attempt to create "perfect memory". As you scale up, the >>> assumption of "perfect computation" becomes less realistic, so that >>> means >>> your application (or the infrastructure on which the application >>> sits) has >>> to explicitly address failures, because at sufficiently large >>> scale, they >>> are inevitable. Once you've dealt with that, then whether ECC is >>> needed >>> or not (or better power supplies, or cooling fans, or lunar gravity >>> phase >>> compensation, or whatever) is part of your computational design and >>> budget: it might be cheaper (using whatever metric) to >>> overprovision and >>> allow errors than to buy fewer better widgets. >> >> I don't know whether for all clusters 'outages' is a big issue - here >> in Western Europe we hardly have >> power failures, so i would imagine it if a company with a cluster >> doesn't invest into batterypacks, >> as their company won't be able to run anyway if there isn't power. >> >> More interesting is the ECC discussion. >> >> ECC is simply a requirement IMHO, not a 'luxury thing' as some >> hardware engineers see it. >> >> I know some memory engineers disagree here - for example one of them >> mentionned to me that "putting ECC onto a GPU >> is nonsense as it is a lot of effort and DDR5 already has a built in >> CRC" something like that (if i remember the quote correctly). >> >> But they do not administer servers themselves. >> >> Also they don't understand the accuracy or better LACK of accuracy in >> checking calculations done by >> some who calculate at big iron. If you calculate at a cluster and get >> after some months a result - reality is simply that >> 99% of the researchers isn't as good as the Einstein league >> researchers and 90% simply sucks too much by any standards >> in this sense that they wouldn't see an obvious problem get generated >> by a bitflip here or there. They just would >> happily invent a new theory, as we already have seen too much in >> history. >> >> By simply putting in ECC there you avoid in some percent of the cases >> this 'interpreting the results correctly' problem. >> >> Furthermore there is too many calculations where a single bitflip >> could be catastrophic and calculating >> for a few months at hundreds of cores is asking for trouble then >> without ECC. >> >> As last argument i want to note that in many sciences we simply see >> that the post 2nd world war standard of using alpha = 0.05 >> or an error of at most 5% (2 x standard deviation), simply isn't >> accurate enough anymore for todays generation of scientists. >> >> They need more accuracy. >> >> So historic debates on what is enough or what isn't enough - reducing >> errors by means of using ECC is really important. >> >> Now that said - if someone shows up with a different form of checking >> that's just as accurate or even better - that would be >> acceptable as well - yet most discussions usually with the hardware >> engineers are typically like: "why do all this effort to get >> rid of a few errors meanwhile my windows laptop if it crashes i just >> reboot it". >> >> Such sorts of discussions really should be discussions of the past - >> society is moving on - one needs a far higher accuracy and >> reliability now - simply as the CPU's do more calculations and the >> Memory therefore has to serve more bytes per second. >> >> In all that ECC is a requirement for huge clusters and from my >> viewpoint also for relative tiny clusters. >> >>>> I understand that the most serious limitation of this approach is the >>>> bandwidth and latency (how long it takes for a message to get to the >>>> destination server) of 1Gbps Ethernet. The most obvious alternatives >>>> are using multiple 1Gbps Ethernet connections per server (but this is >>>> complex and only marginally improves bandwidth, while doing little or >>>> nothing for latency) or upgrading to Infiniband. As far as I know, >>>> Infiniband is exotic and expensive compared to the mass market >>>> motherboards etc. from which a Beowulf cluster can be made. In other >>>> words, I think Infiniband is required to make a cluster work really >>>> well, but it does not not (yet) meet the original Beowulf goal of >>>> being >>>> inexpensive and commonly available. >>> >>> Perhaps a distinction should be made between "original Beowulf" and >>> "cluster computer"? As you say, the original idea (espoused in the >>> book, >>> etc.) is a cluster built from cheap commodity parts. That would mean >>> "commodity packaging", "commodity interconnects", etc. which for >>> the most >>> part meant tower cases and ethernet. However, cheap custom sheet >>> metal is >>> now available (back when Beowulfs were first being built, rooms >>> full of >>> servers were still a fairly new and novel thing, and you paid a >>> significant premium for rack mount chassis, especially as consumer >>> pressure forced the traditional tower case prices down) >>> >>>> I think this model of HPC cluster computing remains fundamentally >>>> true, >>>> but there are two important developments in recent years which either >>>> alter the way a cluster would be built or used or which may make the >>>> best solution to a computing problem no longer a cluster. These >>>> developments are large numbers of CPU cores per server, and the >>>> use of >>>> GPUs to do massive amounts of computing, in a single inexpensive >>>> graphic >>>> card - more crunching than was possible in massive clusters a decade >>>> earlier. >>> >>> Yes. But in some ways, utilizing them has the same sort of software >>> problem as using multiple nodes in the first place (EP aside). And >>> the >>> architecture of the interconnects is heterogeneous compared to the >>> fairly >>> uniform interconnect of a generalized cluster fabric. One can >>> raise the >>> same issues with cache, by the way. >>> >>>> The ideal computing system would have a single CPU core which >>>> could run >>>> at arbitrarily high frequencies, with low latency, high bandwidth, >>>> access to an arbitrarily large amount of RAM, with matching links to >>>> hard disks or other non-volatile storage systems, with a good >>>> Ethernet >>>> link to the rest of the world. >>>> >>>> While CPU clock frequencies and computing effort per clock frequency >>>> have been growing slowly for the last 10 years or so, there has >>>> been a >>>> continuing increase in the number of CPU cores per CPU device >>>> (typically >>>> a single chip, but sometimes multiple chips in a device which is >>>> plugged >>>> into the motherboard) and in the number of CPU devices which can be >>>> plugged into a motherboard. >>> >>> That's because CPU clock is limited by physics. "work per clock >>> cycle" is >>> also limited by physics to a certain extent (because today's >>> processors >>> are mostly synchronous, so you have a propagation delay time from >>> one side >>> of the processor to the other) except for things like array processors >>> (SIMD) but I'd say that's just multiple processors that happen to >>> be doing >>> the same thing, rather than a single processor doing more. >>> >>> The real force driving multiple cores is the incredible expense of >>> getting >>> on and off chip. Moving a bit across the chip is easy, compared to >>> off >>> chip: you have to change the voltage levels, have enough current >>> to drive >>> a trace, propagate down that trace, receive the signal at the other >>> end, >>> shift voltages again. >>> >>>> Most mass market motherboards are for a single CPU device, but >>>> there are >>>> a few two and four CPU motherboards for Intel and AMD CPUs. >>>> >>>> It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU >>>> cores >>>> per CPU device. I think the 4 core i7 CPUs or their ECC- >>>> compatible Xeon >>>> equivalents are marginally faster than those with 6 or 8 cores. >>>> >>>> In all cases, as far as I know, combining multiple CPU cores and/or >>>> multiple CPU devices results in a single computer system, with a >>>> single >>>> operating system and a single body of memory, with multiple CPU cores >>>> all running around in this shared memory. >>> >>> Yes.. That's a fairly simple model and easy to program for. >>> >>>> I have no clear idea how each >>>> >>>> CPU core knows what the other cores have written to the RAM they are >>>> using, since each core is reading and writing via its own cache of >>>> the >>>> memory contents. This raises the question of inter-CPU-core >>>> communications, within a single CPU chip, between chips in a multi- >>>> chip >>>> CPU module, and between multiple CPU modules on the one motherboard. >>> >>> Generally handled by the OS kernel. In a multitasking OS, the >>> scheduler >>> just assigns the next free CPU to the next task. Whether you >>> restore the >>> context from processor A to processor A or to processor B doesn't make >>> much difference. Obviously, there are cache issues (since that's >>> part of >>> context). This kind of thing is why multiprocessor kernels are non- >>> trivial. >>> >>>> I understand that MPI works identically from the programmer's >>>> perspective between CPU-cores on a shared memory computer as between >>>> CPU-cores on separate servers. However, the performance (low latency >>>> and high bandwidth) of these communications within a single shared >>>> memory system is vastly higher than between any separate servers, >>>> which >>>> would rely on Infiniband or Ethernet. >>> >>> Yes. This is a problem with a simple interconnect model.. It doesn't >>> necessarily reflect the cost of the interconnect is different >>> depending on >>> how far and how fast you're going. That said, there is a fair >>> amount of >>> research into this. Hypercube processors had limited interconnects >>> between nodes (only nearest neighbor) and there are toroidal >>> fabrics (2D >>> interconnects) as well. >>> >>>> So even if you have, or are going to write, MPI-based software >>>> which can >>>> run on a cluster, there may be an argument for not building a >>>> cluster as >>>> such, but for building a single motherboard system with as many as 64 >>>> CPU cores. >>> >>> Sure.. If your problem is of a size that it can be solved by a >>> single box, >>> then that's usually the way to go. (It applies in areas outside of >>> computing.. Better to have one big transmitter tube than lots of >>> little >>> ones). But it doesn't scale. The instant the problem gets too big, >>> then >>> you're stuck. The advantage of clusters is that they are >>> scalable. Your >>> problem gets 2x bigger, in theory, you add another N nodes and you're >>> ready to go (Amdahl's law can bite you though). >>> >>> There's even been a lot of discussion over the years on this list >>> about >>> the optimum size cluster to build for a big task, given that >>> computers are >>> getting cheaper/more powerful. If you've got 2 years worth of >>> computing, >>> do you buy a computer today that can finish the job in 2 years, or >>> do you >>> do nothing for a year and buy a computer that is twice as fast in a >>> year. >>> >>>> I think the major new big academic cluster projects focus on >>>> getting as >>>> many CPU cores as possible into a single server, while minimising >>>> power >>>> consumption per unit of compute power, and then hooking as many as >>>> possible of these servers together with Infiniband. >>> >>> That might be an aspect of trying to make a general purpose computing >>> resource within a specified budget. >>> >>>> Here is a somewhat rambling discussion of my own thoughts regarding >>>> clusters and multi-core machines, for my own purposes. My >>>> interests in >>>> high performance computing involve music synthesis and physics >>>> simulation. >>>> >>>> There is an existing, single-threaded (written in C, can't be made >>>> multithreaded in any reasonable manner) music synthesis program >>>> called >>>> Csound. I want to use this now, but as a language for synthesis, I >>>> think it is extremely clunky. So I plan to write my own program - >>>> one >>>> day . . . When I do, it will be written in C++ and >>>> multithreaded, so >>>> it will run nicely on multiple CPU-cores in a single machine. >>>> Writing >>>> and debugging a multithreaded program is more complex than doing >>>> so for >>>> a single-threaded program, but I think it will be practical and a lot >>>> easier than writing and debugging an MPI based program running >>>> either on >>>> on multiple servers or on multiple CPU-cores on a single server. >>> >>> Maybe, maybe not. How is your interthread communication architecture >>> structured? Once you bite the bullet and go with a message passing >>> model, >>> it's a lot more scalable, because you're not doing stuff like "shared >>> memory". >>> >>>> I want to do some simulation of electromagnetic wave propagation >>>> using >>>> an existing and widely used MPI-based (C++, open source) program >>>> called >>>> Meep. This can run as a single thread, if there is enough RAM, or >>>> the >>>> problem can be split up to run over multiple threads using MPI >>>> communication between the threads. If this is done on a single >>>> server, >>>> then the MPI communication is done really quickly, via shared memory, >>>> which is vastly faster than using Ethernet or Inifiniband to other >>>> servers. However, this places a limit on the number of CPU-cores and >>>> the total memory. When simulating three dimensional models, the >>>> RAM and >>>> CPU demands can easily become extremely demanding. Meep was >>>> written to >>>> split the problem into multiple zones, and to work efficiently >>>> with MPI. >>> >>> As you note, this is advantage of setting up a message passing >>> architecture from the beginning.. It works regardless of the scale/ >>> method >>> of message passing. There *are* differences in performance. >>> >>>> Ten or 15 years ago, the only way to get more compute power was to >>>> build >>>> a cluster and therefore to write the software to use MPI. This was >>>> because CPU-devices had a single core (Intel Pentium 3 and 4) and >>>> because it was rare to find motherboards which handled multiple such >>>> chips. >>> >>> Yes >>> >>>> The next step would be to get a 4 socket motherboard from Tyan or >>>> SuperMicro for $800 or so and populate it with 8, 12 or (if money >>>> permits) 16 core CPUs and a bunch of ECC RAM. >>>> >>>> My forthcoming music synthesis program would run fine with 8 or >>>> 16GB of >>>> RAM. So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron >>>> machines would do the trick nicely. >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>> Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > > -- > ************************************************************* > Jörg Saßmannshausen > University College London > Department of Chemistry > Gordon Street > London > WC1H 0AJ > > email: j.sassmannshau...@ucl.ac.uk > web: http://sassy.formativ.net > > Please avoid sending me Word or PowerPoint attachments. > See http://www.gnu.org/philosophy/no-word-attachments.html > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf