Mark, Thanks, that led me (with a bit of wandering) to e.g. http://www.cs.virginia.edu/stream/top20/Balance.html. My immediate concern is for an app that is worse than embarassingly parallel; it can't (currently) trade memory for time, and can't really use any memory or network effectively, by the list's standards. Basically I want a zillion CPUs and they can communicate by crayon on postcard. That's not practical, but my initial valuator is just GHz/$. I care about the memory sharing and message passing efficiency issues only in that I want to smarten up my app to take advantage of other economies. Peter
On 3/8/07, Mark Hahn <[EMAIL PROTECTED]> wrote:
> Great thanks. That was clear and the takeaway is that I should pay attention > to the number of memory channels per core (which may be less than 1.0) I think the takeaway is a bit more acute: if your code is cache-friendly, simply pay attention to cores * clock * flops/cycle. otherwise (ie, when your models are large), pay attention to the "balance" between observed memory bandwidth and peak flops. the stream benchmark is a great way to do this, and has traditionally promulgated the "balance" argument. here's an example: http://www.cs.virginia.edu/stream/stream_mail/2007/0001.html basically, 13 GB/s for a 2x2 opteron/2.8 system (peak flops would be 2*2*2*2.8=22.4, so you need 1.7 flops per byte to be happy. I don't have a report handy for core2, but iirc, people report hitting a wall of around 9 GB/s for any dual-FSB core2 system. assuming dual-core parts like the paper, peak theoretical flops is 37 GFlops, for a balance of just over 4. that ratio should really be called "imbalance" ;) quad-core would be worse, of course.
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf