On Apr 4, 2011, at 6:54 PM, Massimiliano Fatica wrote: > If you are old enough to remember the time when the first distribute > computers appeared on the scene, > this is a deja-vu. Developers used to program on shared memory ( > mostly with directives) were complaining > about the new programming models ( PVM, MPL, MPI). > Even today, if you have a serial code there is no tool that will make > your code runs on a cluster. > Even on a single system, if you try an auto-parallel/auto-vectorizing > compiler on a real code, your results will probably be disappointing. > > When you can get a 10x boost on a production code rewriting some > portions of your code to use the GPU, if time to solution is important
Oh comeon factor 10 is not realistic. You're doing the usual compare here of a hobby coder who coded a tad in C or slowish C++ (except for a SINGLE, so not several, NCSA coder i'll have to find the first C++ guy who can write codes equally fast to C for complex algorithms - granted for big companies C++ makes more sense, just not when it's about performance) and then compare that with a full blown sponsored project in CUDA that uses the topend gpu and compare it versus a single core instead of 4 sockets (as that's powerwise the same). Moneywise of course is another issue, that's where the gpu's win it bigtime. Yet there is a hidden cost in gpu's, that's you can build something way faster for less money with gpu's, but you also need to pay for a good coder to write your code in either CUDA or AMD-CAL (or as the chinese seem to support both at the same time, which is not so complicated if you have setup things in the correct manner). This last is a big problem for the western world; governments pay big bucks for hardware, but paying good coders what they are worth they seem to forget. Secondly there is another problem, that's that NVIDIA hasn't even released the instructoin set of their GPU. Try to figure that out without fulltime work for it. It seems however pretty similar to AMD, despite other huge architectural differences between the 2; the programming similarity is striking and selfexplains the real purpose where they got designed for (GRAPHICS). > or you could perform simulations that were impossible before ( for > example using algorithms that were just too slow on CPUs, All true yet it takes a LOT OF TIME to write something that's fast on a gpu. First of all you have to not write double precision code, as the gamers card from nvidia seem to not have much double precision logic, they only have 32 bits logics. So at double precision, AMD is like 10 times faster in money per gflop than Nvidia. Yet try to figure that out without being fulltime busy with those gpu's. Only the TESLA versions have those transistors it seems. Secondly Nvidia seems to keep being busy maximizing the frequency of the gpu. Now that might be GREAT for games as high clocked cores work (see intel), yet for throughput of course that's a dead end. In raw throughput AMD's (ATI's) approach will always win it of course from nvidia, as clocking a processor higher has a O ( n ^ 3 ) impact on power consumption. Now a big problem with nvidia is also that they basically go over spec. I didn't really figure it out, yet it seems pci-e got designed with 300 watt in mind max. Yet at this code i'm busy with, the CUDA version of it (mfaktc) consumes a whopping 400+ watt and please realize that majority of the system time is only keeping the streamcores busy and not caches at all nor much of a RAM. It's only doing multiplications of course at full speed in 32 bits code, using the new Fermi's instructions that allows multiplying 32 bits x 32 bits == 64 bits. CUDA version of your code gets developed btw by a guy working for a HPC vendor which, i guess, also sells those Tesla's. So any performance bragging sure must keep in mind it's far over 33% over the specs in terms of power consumption. Note AMD seems to follow nvidia in its path there. > Discontinuous Galerkin method is a perfect example), there are a lot > of developers that will write the code. Oh comeon, writing for gpu's is really complicated. > The effort it is clearly dependent of the code, the programmer and the > tool used ( you can go from fully custom GPU code with CUDA or OpenCL, Forget OpenCL, not good enough. Better to code in CUDA and AMD-CAL at the same time something. > to automatically generated CUF kernels from PGI, to directives using > HMPP or PGI Accelerator). > In situation where time to solution relates to money, for example > oil and gas, GPUs are the answer today ( you will be surprised > by the number of GPUs in Houston). Pardon me, those industries already were using vectorized solutoins long before CUDA was there and are using massively GPU's to calculate of course as soon as nvidia released a version that was programmable. This is not new. All those industries will of course never say anything on the performance nor how many they use. > Look at the performance and scaling of AMBER ( MPI+ CUDA), > http://ambermd.org/gpus/benchmarks.htm, and tell me that the results > were not worth the effort. > > Is GPU programming for everyone: probably not, in the same measure > that parallel programming in not for everyone. > Better tools will lower the threshold, but a threshold will be > always present. > I would argue that both AMD as well as Nvidia has really tried to give the 3d world nations an advantage by stopping progress in the rich nations. I will explain. The real big advantage of rich nations is that average persons have more cash. Students are a good example there. They can afford gpu's easily. Yet there is so little technical information available on latencies and in case of nvidia on instructoin set that the gpu's support, that this gives a huge programming hurdle for students. Also there is no good tips in nvidia documents how to program for those things. The most fundamental lessons how to program a gpu i miss in all documents i scanned so far. It's just a bunch of 'lectures' that's not going to create any topcoders. A piece of information here and a tad there. Very bad. AMD also is a nightmare there, they can't even run more than 1 program at the same time, despite claims that the 4000 series gpu's already had hardware support to do it. The indian helpdesk in fact is so lazy that they didn't even rename the word 'ati' in the documentation to AMD, and the library each few months gets a new name. Stream SDK now it's another new fancy name. "we worked hard in India sahib, yes sahib, yes sahib". Yet 5 years later still not much works. For example in opencl also the 2nd gpu doesn't work in case of AMD. Result "undefined". Nice. Default driver install at inux here doesn't get openCL to work in fact at the 6970. Both nvidia as well as AMD are a total joke there and by means of incompetence, the generic incompetence being complete and clear documentation just like we have documention on how cpu's work. Be it intel or AMD or IBM. Students who program now for those gpu's in CUDA or AMD-CAL, they will have to go to hell and back to get something to work well on it, except some trivial stuff that works well at it. We see that just a few manage. That's not a problem of the students, but a problem for society, because doing calculations faster and especially CHEAP, is a huge advantage to progress science. NSA type organisations in 3d world nations are a lot bigger than here, simply because more people live there. So right now more people over there code for gpu's than here, here where everyone can afford one. Some big companies excepted of course, but this is not a small note on companies. This is a note on 1st world versus 3d world. The real difference is students with budget over here. They have budget for gpu's, yet there is no good documentation simply giving which instructions a gpu has let alone which latencies. If you google hard, you will find 1 guy who actually by means of measuring had to measure the latencies of simple instructions that write to the same register. Why did an university guy need to measure this, why isn't this simply in Nvidia documentation? A few of those things will of course have majority, vaste vaste majority of students trying something on a gpu, completely fail. Because they fail, they don't continue there and don't get back from those gpu's a faster running code that gives them something very important: faster calculation speed for whatever they wanted to run. This is where AMD and Nvidia, and i politely call it by means of incompetence, gives the rich nations no advantage over the 3d world nations, as the students need to be compeltely fulltime busy to obtain knowledge on the internal workings of the gpu's in order to get something going fast at them. Majority will fail therefore of course, which has simply avoided gpu's from getting massively adapted. I've seen so many students try and fail at gpu programming, especially CUDA. It's bizarre. The fail % is so huge. Even a big succes doesn't get recognized as a big succes, simply because the guy didn't know about a few bottlenecks in gpu programming, as no manual told him the combination of problems he ran into, as there was no technical data available. It is true gpu's can be fast, but i feel there is a big need for better technical documentation of them. We can no longer ignore this now that 3d world nations are overrunning 1st world nations. Mainly because the sneaky organisations that do know everything are of course bigger over there than here, by means of population size. This where the huge advantage of the rich nations, namely that every student has such gpu at home, is not getting taken advantage from as the hurdle to gpu programming is too high by means of lack of accurate documentation. Of course in 3d world nations they have at most a mobile phone, and very very seldom a laptop (except for the rich elite), let alone a computer with a capable programmable gpu, which makes it impossible for majority of 3d world nations students to do any gpu computation because of a shortage in cash. > > Massimiliano > PS: Full disclosure, I work at Nvidia on CUDA ( CUDA Fortran, > applications porting with CUDA, MPI+CUDA). > > > 2011/4/4 "C. Bergström" <[email protected]>: >> Herbert Fruchtl wrote: >>> They hear great success stories (which in reality are often >>> prototype >>> implementations that do one carefully chosen benchmark well), >>> then look at the >>> API, look at their existing code, and postpone the start of their >>> project until >>> they have six months spare time for it. And we know when that is. >>> >>> The current approach with more or less vendor specific libraries >>> (be they "open" >>> or not) limits the uptake of GPU computing to a few hardcore >>> developers of >>> experimental codes who don't mind rewriting their code every two >>> years. It won't >>> become mainstream until we have a compiler that turns standard >>> Fortran (or C++, >>> if it has to be) into GPU code. Anything that requires more >>> change than let's >>> say OpenMP directives is doomed, and rightly so. >>> >> Hi Herbert, >> >> I think your perspective pretty much nails it >> >> (shameless self promotion) >> http://www.pathscale.com/ENZO (PathScale HMPP - native codegen) >> http://www.pathscale.com/pdf/PathScale-ENZO-1.0-UserGuide.pdf >> http://www.caps-entreprise.com/hmpp.html (CAPS HMPP - source to >> source) >> >> This is really only the tip of the problem and there must also be >> solutions for scaling *efficiently* across the cluster. (No MPI + >> CUDA >> or even HMPP is *not* the answer imho.) >> >> ./C >> _______________________________________________ >> Beowulf mailing list, [email protected] sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
