>> absent (along with AMD :( ). it's not clear how generally applicable >> Cuda's SIMT programming model is. and having it as a separate ISA >> (versus traditional cores) is a problem, complexity-wise. > > In general CUDA seems to hide the hardware poorly, so > it probably doomed long-term.
well, cuda exposes a quite different programming model; I'm not sure they can do much about the generational differences they expose. (say, generational differences in what kind of atomic operations are supported by the hardware.) many of the exposures are primarily tuning knobs (warp width, number of SMs, cache sizes, ratio of DP units per thread.) a very high-level interface like OpenACC doesn't expose that stuff - is that more what you're looking for? there's no denying that you get far less expressive power... >> stacking is great, but not that much different from MCMs, is it? > > Real memory stacking a la TSV has smaller geometries, way more > wire density, lower power burn, and seems to boost memory bandwidth > by one order of magnitude sorry, do you have some reference for this? what I'm reading is that TSV and chip-on-chip stacking is fine, but not dramatically different from chip-bumps (possibly using TSV) connecting to interposer boards. obviously, attaching chips to fine, tiny, low-impedence, wide-bus interposers gives a lot of flexibility in designing packages. > http://nepp.nasa.gov/workshops/etw2012/talks/Tuesday/T08_Dillon_Through_Silicon_Via.pdf that's useful, thanks. it's a bit high-end-centric - no offence, but NASA and high-volume mass production are not entirely aligned ;) it paints 2.5d as quite ersatz, but I didn't see a strong data argument. sure, TSVs will operate on a finer pitch than solder bumps, but the xilinx silicon interposer also seems very attractive. do you actually get significant power/speed benefits from pure chip-chip contacts versus an interposer? I guess not: that the main win is staying in-package. it is interesting to think, though: if you can connect chips with extremely wide links, does that change your architecture? for instance, dram is structured as 2d array of bit cells that are read out into a 1d slice (iirc, something like 8kbits). cpu r/w requests are satisfied from within this slice faster since it's the readout from 2d that's expensive. but suppose a readout pumped all 8kb to the cpu - sort of a cache line 16x longer than usual. considering the proliferation of 128-512b-wide SIMD units, maybe this makes perfect sense. this would let you keep vector fetches from flushing all the non-vector stuff out of your normal short-line caches... _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
