See this paper http://synergy.cs.vt.edu/pubs/papers/daga-saahpc11-apu-efficacy.pdf
While discrete GPUs underperform wrt APU on host to/from device transfers in a ratio of ~2X, it compensates by far the computing power and local bandwidth ~8-10X. You can cook though a test where you do little computation and it is all bound by the host to/from device transfers. Programming wise there is no difference as there isn't yet coherence so explicit transfers through API calls are needed. Joshua ------ Original Message ------ Received: 04:06 PM CDT, 03/10/2013 From: Vincent Diepeveen <d...@xs4all.nl> To: Mark Hahn <h...@mcmaster.ca>Cc: Beowulf List <beowulf@beowulf.org> Subject: Re: [Beowulf] difference between accelerators and co-processors > > On Mar 10, 2013, at 9:03 PM, Mark Hahn wrote: > > >> Is there any line/point to make distinction between accelerators and > >> co-processors (that are used in conjunction with the primary CPU > >> to boost > >> up the performance)? or these terms can be used interchangeably? > > > > IMO, a coprocessor executes the same instruction stream as the > > "primary" processor. this was the case with the x87, for instance, > > though the distinction became less significant once the x87 came > > onchip. > > (though you certainly notice that FPU on any of these chips is mostly > > separate - not sharing functional units or register files, > > sometimes even > > with separate micro-op schedulers.) > > > >> Specifically, the word "accelerator" is used commonly with GPU. On > >> the > >> other hand the word "co-processors" is used commonly with Xeon Phi. > > > > I don't think it is a useful distinction: both are basiclly > > independent > > computers. obviously, the programming model of Phi is dramatically > > more > > like a conventional processor than Nvidia. > > > > Mark, that's the marketing talk about Xeon Phi. > > It's surprisingly the same of course except for the cache coherency; > big vector processors. > > > there is a meaningful distinction between offload and coprocessor > > approaches. > > that is, offload means you use the device to accelerate a set of > > libraries > > (offload matrix multiply, eig, fft, etc). to use a coprocessor, I > > think the > > expectation is that the main code will be very much aware of the > > state of the > > PCIe-attached hardware. > > > > I suppose one might suggest that "accelerator" to some extent implies > > offload usage: you're accelerating a library. > > > > another interesting example is AMD's upcoming HSA concept: since > > nearly all > > GPUs are now on-chip, AMD wants to integrate the CPU and GPU > > programming > > models (at least to some extent). as far as I understand it, HSA > > is based > > on introducing a quite general intermediate ISA that can be > > executed using > > all available hardware resources: CPU and/or GPU. although Nvidia > > does have > > its own intermediate ISA, they don't seem to be trying to make it > > general, > > *and* they don't seem interested in making it work on both C/GPU. > > (well, > > so far at least - I wouldn't be surprised if they _did_ have a PTX > > JIT for > > their ARM-based C/GPU chips...) > > > > I think HSA is potentially interesting for HPC, too. > > I really expect > > AMD and/or Intel to ship products this year that have a C/GPU chip > > mounted on > > the same interposer as some high-bandwidth ram. > > How can an integrated gpu outperform a gpgpu card? > > Something like what is it 25 watt versus 250 watt, what will be faster? > > I assume you will not build 10 nodes with 10 cpu's with integrated > gpu in order to rival a > single card. > > > a fixed amount of very high > > performance memory sounds very tasty to me. a surprising amount of > > power > > in current systems is spend getting high-speed signals off-socket. > > > > imagine a package dissipating say 40W containing a, say, 4 CPU cores, > > 256 GPU ALUs and 2GB of gddr5. the point would be to tile 32 of them > > in a 1U box. (dropping socketed, off-package dram would probably make > > it uninteresting for memcached and some space-intensive HPC. > > > > then again, if you think carefully about the numbers, any code today > > that has a big working set is almost as anachronistic as codes that > > use > > disk-based algorithms. (same conceptual thing happening: capacity is > > growing much faster than the pipe.) > > > > regards, mark hahn. > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin > > Computing > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf