Especially informative! Your specific examples shed light on just the sort of things I was trying to better understand. A thousand thanks:)
On Sun, Nov 4, 2012 at 9:55 AM, Robin Whittle <r...@firstpr.com.au> wrote: > On 2012-10-31 CJ O'Reilly asked some pertinent questions about HPC, > Cluster, Beowulf computing from the perspective of a newbie: > > http://www.beowulf.org/pipermail/beowulf/2012-October/030359.html > > The replies began in the "Digital Image Processing via > HPC/Cluster/Beowulf - Basics" thread on the November pages of the > Beowulf Mailing List archives. (The archives strip off the "Re: " from > the replies' subject line.) > > http://www.beowulf.org/pipermail/beowulf/2012-November/thread.html > > Mark Hahn wrote a response which I and MC O'Reilly found most > informative. For the benefit of other lost souls wandering into this > fascinating field, I nominate Mark's reply as a: > > Beowulf/Cluster/HPC FAQ for newbies > Mark Hahn 2012-11-03 > http://www.beowulf.org/pipermail/beowulf/2012-November/030363.html > > Quite likely the other replies and those yet to come will be well worth > reading too, so please check the November thread index, and potentially > December if the discussions continue there. Mark wrote a second > response exploring options for some hypothetical image processing problems. > > Googling for an HPC or Cluster FAQ lead to FAQs for people who are users > of specific clusters. I couldn't easily find a FAQ suitable for people > with general computer experience wondering what they can and can't do, > or at least what would be wise to do or not do, with clusters, High > Performance Computing etc. > > The Wikipedia articles: > > http://en.wikipedia.org/wiki/High-performance_computing > http://en.wikipedia.org/wiki/High-throughput_computing > > are of general interest. > > Here are some thoughts which might complement what Mark wrote. I am a > complete newbie to this field and I hope more experienced people will > correct or expand on the following. > > HPC has quite a long history and my understanding of the Beowulf concept > is to make clusters using easily available hardware and software, such > as operating systems, servers and interconnects. The Interconnects, I > think are invariably Ethernet, since anything else is, or at least was > until recently, exotic and therefore expensive. > > Traditionally - going back a decade or more - I think the assumptions > have been that a single server has one or maybe two CPU cores and that > the only way to get a single system to do more work is to interconnect a > large number of these servers with a good file system and fast > inter-server communications so that suitably written software (such as > something written to use MPI so the instances of the software on > multiple servers and/or CPU cores and work on the one problem > efficiently) can do the job well. All these things - power supplies, > motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the > motherboard) and Ethernet switches are inexpensive and easy to throw > together into a smallish cluster. > > However, creating software to use it efficiently can be very tricky - > unless of course the need can be satisfied by MPI-based software which > someone else has already written. > > For serious work, the cluster and its software needs to survive power > outages, failure of individual servers and memory errors, so ECC memory > is a good investment . . . which typically requires more expensive > motherboards and CPUs. > > I understand that the most serious limitation of this approach is the > bandwidth and latency (how long it takes for a message to get to the > destination server) of 1Gbps Ethernet. The most obvious alternatives > are using multiple 1Gbps Ethernet connections per server (but this is > complex and only marginally improves bandwidth, while doing little or > nothing for latency) or upgrading to Infiniband. As far as I know, > Infiniband is exotic and expensive compared to the mass market > motherboards etc. from which a Beowulf cluster can be made. In other > words, I think Infiniband is required to make a cluster work really > well, but it does not not (yet) meet the original Beowulf goal of being > inexpensive and commonly available. > > > I think this model of HPC cluster computing remains fundamentally true, > but there are two important developments in recent years which either > alter the way a cluster would be built or used or which may make the > best solution to a computing problem no longer a cluster. These > developments are large numbers of CPU cores per server, and the use of > GPUs to do massive amounts of computing, in a single inexpensive graphic > card - more crunching than was possible in massive clusters a decade > earlier. > > The ideal computing system would have a single CPU core which could run > at arbitrarily high frequencies, with low latency, high bandwidth, > access to an arbitrarily large amount of RAM, with matching links to > hard disks or other non-volatile storage systems, with a good Ethernet > link to the rest of the world. > > While CPU clock frequencies and computing effort per clock frequency > have been growing slowly for the last 10 years or so, there has been a > continuing increase in the number of CPU cores per CPU device (typically > a single chip, but sometimes multiple chips in a device which is plugged > into the motherboard) and in the number of CPU devices which can be > plugged into a motherboard. > > Most mass market motherboards are for a single CPU device, but there are > a few two and four CPU motherboards for Intel and AMD CPUs. > > It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores > per CPU device. I think the 4 core i7 CPUs or their ECC-compatible Xeon > equivalents are marginally faster than those with 6 or 8 cores. > > In all cases, as far as I know, combining multiple CPU cores and/or > multiple CPU devices results in a single computer system, with a single > operating system and a single body of memory, with multiple CPU cores > all running around in this shared memory. I have no clear idea how each > CPU core knows what the other cores have written to the RAM they are > using, since each core is reading and writing via its own cache of the > memory contents. This raises the question of inter-CPU-core > communications, within a single CPU chip, between chips in a multi-chip > CPU module, and between multiple CPU modules on the one motherboard. > > It is my impression that the AMD socket G34 Opterons: > > http://en.wikipedia.org/wiki/Opteron#Socket_G34 > > Magny-Cours Opteron 6100 and Interlagos 6200 devices solve these > problems better than current Intel chips. For instance, it is possible > to buy 2 and 4 socket motherboards (though they are not mass-market, and > may require fancy power supplies) and then to plug in 8, 12 or 16 core > CPU devices into them, with a bunch of DDR3 memory (ECC or not) and so > make yourself a single shared memory computer system with compute power > which would have only been achievable with a small cluster 5 years ago. > > I understand the G34 Opterons have a separate Hypertransport link > between any one CPU-module and the other three CPU modules on the > motherboard on a 4 CPU-module motherboard. > > I understand that MPI works identically from the programmer's > perspective between CPU-cores on a shared memory computer as between > CPU-cores on separate servers. However, the performance (low latency > and high bandwidth) of these communications within a single shared > memory system is vastly higher than between any separate servers, which > would rely on Infiniband or Ethernet. > > So even if you have, or are going to write, MPI-based software which can > run on a cluster, there may be an argument for not building a cluster as > such, but for building a single motherboard system with as many as 64 > CPU cores. > > I think the major new big academic cluster projects focus on getting as > many CPU cores as possible into a single server, while minimising power > consumption per unit of compute power, and then hooking as many as > possible of these servers together with Infiniband. > > Here is a somewhat rambling discussion of my own thoughts regarding > clusters and multi-core machines, for my own purposes. My interests in > high performance computing involve music synthesis and physics simulation. > > There is an existing, single-threaded (written in C, can't be made > multithreaded in any reasonable manner) music synthesis program called > Csound. I want to use this now, but as a language for synthesis, I > think it is extremely clunky. So I plan to write my own program - one > day . . . When I do, it will be written in C++ and multithreaded, so > it will run nicely on multiple CPU-cores in a single machine. Writing > and debugging a multithreaded program is more complex than doing so for > a single-threaded program, but I think it will be practical and a lot > easier than writing and debugging an MPI based program running either on > on multiple servers or on multiple CPU-cores on a single server. > > I want to do some simulation of electromagnetic wave propagation using > an existing and widely used MPI-based (C++, open source) program called > Meep. This can run as a single thread, if there is enough RAM, or the > problem can be split up to run over multiple threads using MPI > communication between the threads. If this is done on a single server, > then the MPI communication is done really quickly, via shared memory, > which is vastly faster than using Ethernet or Inifiniband to other > servers. However, this places a limit on the number of CPU-cores and > the total memory. When simulating three dimensional models, the RAM and > CPU demands can easily become extremely demanding. Meep was written to > split the problem into multiple zones, and to work efficiently with MPI. > > For Csound, my goal is to get a single piece of music synthesised in as > few hours as possible, so as to speed up the iterative nature of the > cycle: alter the composition, render it, listen to it and alter the > composition again. Probably the best solution to this is to buy a > bleeding edge clock-speed (3.4GHz?) 4-core Intel i7 CPU and motherboard, > which are commodity consumer items for the gaming market. Since Csound > is probably limited primarily by integer and floating point processing, > rather than access to large amounts of memory or by reading and writing > files, I could probably render three or four projects in parallel on a > 4-core i7, with each rendering nearly as fast as if just one was being > rendered. However, running four projects in parallel is of only > marginal benefit to me. > > If I write my own music synthesis program, it will be in C++ and will be > designed for multithreading via multiple CPU-cores in a shared memory > (single motherboard) environment. It would be vastly more difficult to > write such a program using MPI communications. Music projects can > easily consume large amounts of CPU and memory power, but I would rather > concentrate on running a hot-shot single motherboard multi-CPU shared > memory system and writing multi-threaded C++ than to try to write it for > MPI. I think any MPI-based software would be more complex and generally > slower, even on a single server (MPI communications via shared memory > rather than Ethernet/Infiniband) than one which used multiple threads > each communicating via shared memory. > > Ten or 15 years ago, the only way to get more compute power was to build > a cluster and therefore to write the software to use MPI. This was > because CPU-devices had a single core (Intel Pentium 3 and 4) and > because it was rare to find motherboards which handled multiple such chips. > > Now, with 4 CPU-module motherboards it is totally different. A starting > point would be to get a 2-socket motherboard and plug 8-core Opterons > into it. This can be done for less than $1000, not counting RAM and > power supply. For instance the Asus KGPE-D16 can be bought for $450. 8 > core 2.3GHz G34 Opterons can be found on eBay for $200 each. > > I suspect that the bleeding edge mass-market 4 core Intel i7 CPUs are > probably faster per core than these Opterons. I haven't researched > this, but they are a much faster clock-rate at 3.6GHz and the CPU design > is the latest product of Intel's formidable R&D. On the other hand, the > Opterons have Hypertransport and I think have a generally strong > floating point performance. (I haven't found benchmarks which > reasonably compare these.) > > I guess that many more 8 and 12 core Opterons may come on the market as > people upgrade their existing systems to use the 16 core versions. > > The next step would be to get a 4 socket motherboard from Tyan or > SuperMicro for $800 or so and populate it with 8, 12 or (if money > permits) 16 core CPUs and a bunch of ECC RAM. > > My forthcoming music synthesis program would run fine with 8 or 16GB of > RAM. So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron > machines would do the trick nicely. > > This involves no cluster, HPC or fancy interconnect techniques. > > Meep can very easily be limited by available RAM. A cluster solves > this, in principle, since it is easy (in principle) to get an arbitrary > number of servers and run a single Meep project on all of them. > However, this would be slow compared to running the whole thing on > multiple CPU-cores in a single server. > > I think the 4 socket G43 Opteron motherboards are probably the best way > to get a large amount of RAM into a single server. Each CPU-module > socket has its own set of DDR3 RAM, and there are four of these. The > Tyan S8812: > > > > http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180 > > has 8 memory slots per CPU-module socket. Populated with 32 x 8GB ECC > memory at about $80 each, this would be 256GB of memory for $2,560. > > As far as I know, if the problem requires more memory than this, then I > would need to use multiple servers in a cluster with MPI communications > via Ethernet or Infiniband. > > However, if the problem can be handled by 32 to 64 CPU-cores with 256GB > of RAM, then doing it in a single server as described would be much > faster and generally less expensive than spreading the problem over > multiple servers. > > The above is a rough explanation for why the increasing number of > CPU-cores per motherboard, together with increased number of DIMM slots > per motherboard with increased affordable memory per DIMM slot means > that many projects which in the past could only be handled with a > cluster, can now be handled faster and less expensively with a single > server. > > This "single powerful multi-core server" approach is particularly > interesting to me regarding writing my own programs for music synthesis > program or physics simulation. The simplest approach would be write the > software in single-threaded C++. However, that won't make use of the > CPU power, so I need to use the inbuilt multithreading capabilities of > C++. This requires a more complex program design, but I figure I can > cope with this. > > In principle I could write a single-thread physics simulation program > and access massive memory via a 4 CPU-module Opteron motherboard, but > any such program would be performance limited by the CPU speed, so it > would make sense to split it up over multiple CPU-cores with > multithreading. > > For my own projects, I think this is the way forward. As far as I can > see, I won't need more memory than 256GB and more CPU power than I can > get on a single motherboard. If I did, then I would need to write or > use MPI-based software and build a cluster - or get some time on an > existing cluster. > > > There is another approach as well - writing software to run on GPUs. > Modern mass-market graphics cards have dozens or hundreds of CPUs in > what I think is a shared, very fast, memory environment. These CPUs are > particularly good at floating point and can be programmed to do all > sorts of things, but only using a specialised subset of C / C++. I want > to write in full C++. However, for many people, these GPU systems are > by far the cheapest form of compute power available. This raises > questions of programming them, running several such GPU board in a > single server, and running clusters of such servers. > > The recent thread on the Titan Supercomputer exemplifies this approach - > get as many CPU-cores and GPU-cores as possible onto a single > motherboard, in a reasonably power-efficient manner and then wire as > many as possible together with Infiniband to form a cluster. > > Mass market motherboards and graphics cards with Ethernet is arguably > "Beowulf". If and when Infiniband turns up on mass-market motherboards > without a huge price premium, that will be "Beowulf" too. > > I probably won't want or need a cluster, but I lurk here because I find > it interesting. > > Multi-core conventional CPUs (64 bit Intel x86 compatible, running > GNU-Linux) and multiple such CPU-modules on a motherboard are chewing > away at the bottom end of the cluster field. Likewise GPUs if the > problem can be handled by these extraordinary devices. > > With clusters, since the inter-server communication system (Infiniband > or Ethernet) - with the accompanying software requirements (MPI and > splitting the problem over multiple servers which do not share a single > memory system) is the most serious bottleneck, the best approach seems > to be to make each server as powerful as possible, and then use as few > of them as necessary. > > - Robin http://www.firstpr.com.au > > --
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf