On 2012-10-31 CJ O'Reilly asked some pertinent questions about HPC, Cluster, Beowulf computing from the perspective of a newbie:
http://www.beowulf.org/pipermail/beowulf/2012-October/030359.html The replies began in the "Digital Image Processing via HPC/Cluster/Beowulf - Basics" thread on the November pages of the Beowulf Mailing List archives. (The archives strip off the "Re: " from the replies' subject line.) http://www.beowulf.org/pipermail/beowulf/2012-November/thread.html Mark Hahn wrote a response which I and MC O'Reilly found most informative. For the benefit of other lost souls wandering into this fascinating field, I nominate Mark's reply as a: Beowulf/Cluster/HPC FAQ for newbies Mark Hahn 2012-11-03 http://www.beowulf.org/pipermail/beowulf/2012-November/030363.html Quite likely the other replies and those yet to come will be well worth reading too, so please check the November thread index, and potentially December if the discussions continue there. Mark wrote a second response exploring options for some hypothetical image processing problems. Googling for an HPC or Cluster FAQ lead to FAQs for people who are users of specific clusters. I couldn't easily find a FAQ suitable for people with general computer experience wondering what they can and can't do, or at least what would be wise to do or not do, with clusters, High Performance Computing etc. The Wikipedia articles: http://en.wikipedia.org/wiki/High-performance_computing http://en.wikipedia.org/wiki/High-throughput_computing are of general interest. Here are some thoughts which might complement what Mark wrote. I am a complete newbie to this field and I hope more experienced people will correct or expand on the following. HPC has quite a long history and my understanding of the Beowulf concept is to make clusters using easily available hardware and software, such as operating systems, servers and interconnects. The Interconnects, I think are invariably Ethernet, since anything else is, or at least was until recently, exotic and therefore expensive. Traditionally - going back a decade or more - I think the assumptions have been that a single server has one or maybe two CPU cores and that the only way to get a single system to do more work is to interconnect a large number of these servers with a good file system and fast inter-server communications so that suitably written software (such as something written to use MPI so the instances of the software on multiple servers and/or CPU cores and work on the one problem efficiently) can do the job well. All these things - power supplies, motherboards, CPUs, RAM, Ethernet (one or two 1GBps Ethernet on the motherboard) and Ethernet switches are inexpensive and easy to throw together into a smallish cluster. However, creating software to use it efficiently can be very tricky - unless of course the need can be satisfied by MPI-based software which someone else has already written. For serious work, the cluster and its software needs to survive power outages, failure of individual servers and memory errors, so ECC memory is a good investment . . . which typically requires more expensive motherboards and CPUs. I understand that the most serious limitation of this approach is the bandwidth and latency (how long it takes for a message to get to the destination server) of 1Gbps Ethernet. The most obvious alternatives are using multiple 1Gbps Ethernet connections per server (but this is complex and only marginally improves bandwidth, while doing little or nothing for latency) or upgrading to Infiniband. As far as I know, Infiniband is exotic and expensive compared to the mass market motherboards etc. from which a Beowulf cluster can be made. In other words, I think Infiniband is required to make a cluster work really well, but it does not not (yet) meet the original Beowulf goal of being inexpensive and commonly available. I think this model of HPC cluster computing remains fundamentally true, but there are two important developments in recent years which either alter the way a cluster would be built or used or which may make the best solution to a computing problem no longer a cluster. These developments are large numbers of CPU cores per server, and the use of GPUs to do massive amounts of computing, in a single inexpensive graphic card - more crunching than was possible in massive clusters a decade earlier. The ideal computing system would have a single CPU core which could run at arbitrarily high frequencies, with low latency, high bandwidth, access to an arbitrarily large amount of RAM, with matching links to hard disks or other non-volatile storage systems, with a good Ethernet link to the rest of the world. While CPU clock frequencies and computing effort per clock frequency have been growing slowly for the last 10 years or so, there has been a continuing increase in the number of CPU cores per CPU device (typically a single chip, but sometimes multiple chips in a device which is plugged into the motherboard) and in the number of CPU devices which can be plugged into a motherboard. Most mass market motherboards are for a single CPU device, but there are a few two and four CPU motherboards for Intel and AMD CPUs. It is possible to get 4 (mass market) 6, 8, 12 or sometimes 16 CPU cores per CPU device. I think the 4 core i7 CPUs or their ECC-compatible Xeon equivalents are marginally faster than those with 6 or 8 cores. In all cases, as far as I know, combining multiple CPU cores and/or multiple CPU devices results in a single computer system, with a single operating system and a single body of memory, with multiple CPU cores all running around in this shared memory. I have no clear idea how each CPU core knows what the other cores have written to the RAM they are using, since each core is reading and writing via its own cache of the memory contents. This raises the question of inter-CPU-core communications, within a single CPU chip, between chips in a multi-chip CPU module, and between multiple CPU modules on the one motherboard. It is my impression that the AMD socket G34 Opterons: http://en.wikipedia.org/wiki/Opteron#Socket_G34 Magny-Cours Opteron 6100 and Interlagos 6200 devices solve these problems better than current Intel chips. For instance, it is possible to buy 2 and 4 socket motherboards (though they are not mass-market, and may require fancy power supplies) and then to plug in 8, 12 or 16 core CPU devices into them, with a bunch of DDR3 memory (ECC or not) and so make yourself a single shared memory computer system with compute power which would have only been achievable with a small cluster 5 years ago. I understand the G34 Opterons have a separate Hypertransport link between any one CPU-module and the other three CPU modules on the motherboard on a 4 CPU-module motherboard. I understand that MPI works identically from the programmer's perspective between CPU-cores on a shared memory computer as between CPU-cores on separate servers. However, the performance (low latency and high bandwidth) of these communications within a single shared memory system is vastly higher than between any separate servers, which would rely on Infiniband or Ethernet. So even if you have, or are going to write, MPI-based software which can run on a cluster, there may be an argument for not building a cluster as such, but for building a single motherboard system with as many as 64 CPU cores. I think the major new big academic cluster projects focus on getting as many CPU cores as possible into a single server, while minimising power consumption per unit of compute power, and then hooking as many as possible of these servers together with Infiniband. Here is a somewhat rambling discussion of my own thoughts regarding clusters and multi-core machines, for my own purposes. My interests in high performance computing involve music synthesis and physics simulation. There is an existing, single-threaded (written in C, can't be made multithreaded in any reasonable manner) music synthesis program called Csound. I want to use this now, but as a language for synthesis, I think it is extremely clunky. So I plan to write my own program - one day . . . When I do, it will be written in C++ and multithreaded, so it will run nicely on multiple CPU-cores in a single machine. Writing and debugging a multithreaded program is more complex than doing so for a single-threaded program, but I think it will be practical and a lot easier than writing and debugging an MPI based program running either on on multiple servers or on multiple CPU-cores on a single server. I want to do some simulation of electromagnetic wave propagation using an existing and widely used MPI-based (C++, open source) program called Meep. This can run as a single thread, if there is enough RAM, or the problem can be split up to run over multiple threads using MPI communication between the threads. If this is done on a single server, then the MPI communication is done really quickly, via shared memory, which is vastly faster than using Ethernet or Inifiniband to other servers. However, this places a limit on the number of CPU-cores and the total memory. When simulating three dimensional models, the RAM and CPU demands can easily become extremely demanding. Meep was written to split the problem into multiple zones, and to work efficiently with MPI. For Csound, my goal is to get a single piece of music synthesised in as few hours as possible, so as to speed up the iterative nature of the cycle: alter the composition, render it, listen to it and alter the composition again. Probably the best solution to this is to buy a bleeding edge clock-speed (3.4GHz?) 4-core Intel i7 CPU and motherboard, which are commodity consumer items for the gaming market. Since Csound is probably limited primarily by integer and floating point processing, rather than access to large amounts of memory or by reading and writing files, I could probably render three or four projects in parallel on a 4-core i7, with each rendering nearly as fast as if just one was being rendered. However, running four projects in parallel is of only marginal benefit to me. If I write my own music synthesis program, it will be in C++ and will be designed for multithreading via multiple CPU-cores in a shared memory (single motherboard) environment. It would be vastly more difficult to write such a program using MPI communications. Music projects can easily consume large amounts of CPU and memory power, but I would rather concentrate on running a hot-shot single motherboard multi-CPU shared memory system and writing multi-threaded C++ than to try to write it for MPI. I think any MPI-based software would be more complex and generally slower, even on a single server (MPI communications via shared memory rather than Ethernet/Infiniband) than one which used multiple threads each communicating via shared memory. Ten or 15 years ago, the only way to get more compute power was to build a cluster and therefore to write the software to use MPI. This was because CPU-devices had a single core (Intel Pentium 3 and 4) and because it was rare to find motherboards which handled multiple such chips. Now, with 4 CPU-module motherboards it is totally different. A starting point would be to get a 2-socket motherboard and plug 8-core Opterons into it. This can be done for less than $1000, not counting RAM and power supply. For instance the Asus KGPE-D16 can be bought for $450. 8 core 2.3GHz G34 Opterons can be found on eBay for $200 each. I suspect that the bleeding edge mass-market 4 core Intel i7 CPUs are probably faster per core than these Opterons. I haven't researched this, but they are a much faster clock-rate at 3.6GHz and the CPU design is the latest product of Intel's formidable R&D. On the other hand, the Opterons have Hypertransport and I think have a generally strong floating point performance. (I haven't found benchmarks which reasonably compare these.) I guess that many more 8 and 12 core Opterons may come on the market as people upgrade their existing systems to use the 16 core versions. The next step would be to get a 4 socket motherboard from Tyan or SuperMicro for $800 or so and populate it with 8, 12 or (if money permits) 16 core CPUs and a bunch of ECC RAM. My forthcoming music synthesis program would run fine with 8 or 16GB of RAM. So one or two of these 16 (2 x 8) to 64 (4 x 16) core Opteron machines would do the trick nicely. This involves no cluster, HPC or fancy interconnect techniques. Meep can very easily be limited by available RAM. A cluster solves this, in principle, since it is easy (in principle) to get an arbitrary number of servers and run a single Meep project on all of them. However, this would be slow compared to running the whole thing on multiple CPU-cores in a single server. I think the 4 socket G43 Opteron motherboards are probably the best way to get a large amount of RAM into a single server. Each CPU-module socket has its own set of DDR3 RAM, and there are four of these. The Tyan S8812: http://www.tyan.com/product_SKU_spec.aspx?ProductType=MB&pid=670&SKU=600000180 has 8 memory slots per CPU-module socket. Populated with 32 x 8GB ECC memory at about $80 each, this would be 256GB of memory for $2,560. As far as I know, if the problem requires more memory than this, then I would need to use multiple servers in a cluster with MPI communications via Ethernet or Infiniband. However, if the problem can be handled by 32 to 64 CPU-cores with 256GB of RAM, then doing it in a single server as described would be much faster and generally less expensive than spreading the problem over multiple servers. The above is a rough explanation for why the increasing number of CPU-cores per motherboard, together with increased number of DIMM slots per motherboard with increased affordable memory per DIMM slot means that many projects which in the past could only be handled with a cluster, can now be handled faster and less expensively with a single server. This "single powerful multi-core server" approach is particularly interesting to me regarding writing my own programs for music synthesis program or physics simulation. The simplest approach would be write the software in single-threaded C++. However, that won't make use of the CPU power, so I need to use the inbuilt multithreading capabilities of C++. This requires a more complex program design, but I figure I can cope with this. In principle I could write a single-thread physics simulation program and access massive memory via a 4 CPU-module Opteron motherboard, but any such program would be performance limited by the CPU speed, so it would make sense to split it up over multiple CPU-cores with multithreading. For my own projects, I think this is the way forward. As far as I can see, I won't need more memory than 256GB and more CPU power than I can get on a single motherboard. If I did, then I would need to write or use MPI-based software and build a cluster - or get some time on an existing cluster. There is another approach as well - writing software to run on GPUs. Modern mass-market graphics cards have dozens or hundreds of CPUs in what I think is a shared, very fast, memory environment. These CPUs are particularly good at floating point and can be programmed to do all sorts of things, but only using a specialised subset of C / C++. I want to write in full C++. However, for many people, these GPU systems are by far the cheapest form of compute power available. This raises questions of programming them, running several such GPU board in a single server, and running clusters of such servers. The recent thread on the Titan Supercomputer exemplifies this approach - get as many CPU-cores and GPU-cores as possible onto a single motherboard, in a reasonably power-efficient manner and then wire as many as possible together with Infiniband to form a cluster. Mass market motherboards and graphics cards with Ethernet is arguably "Beowulf". If and when Infiniband turns up on mass-market motherboards without a huge price premium, that will be "Beowulf" too. I probably won't want or need a cluster, but I lurk here because I find it interesting. Multi-core conventional CPUs (64 bit Intel x86 compatible, running GNU-Linux) and multiple such CPU-modules on a motherboard are chewing away at the bottom end of the cluster field. Likewise GPUs if the problem can be handled by these extraordinary devices. With clusters, since the inter-server communication system (Infiniband or Ethernet) - with the accompanying software requirements (MPI and splitting the problem over multiple servers which do not share a single memory system) is the most serious bottleneck, the best approach seems to be to make each server as powerful as possible, and then use as few of them as necessary. - Robin http://www.firstpr.com.au _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf