On Tue, Aug 15, 2006 at 12:29:02PM +0100, Kozin, I (Igor) wrote: > > Good point which makes perfect sense to me. > Given that the theoretical maximum is actually 21.3 GB/s > the real maximum Triad number must be 21.3/3 = 7.1 GB/s. > And that's the best number I've heard of.
Then how do you explain a dual opteron with two 6.4GB/sec (peak) memory system, 12.8GB/sec total per node managing 9-10GB/sec? 12.8/3=4.26GB/sec. People are seeing well over twice that. If the opteron manages 75% efficiency or so on a 12.8GB/sec memory system, why does woodcrest manage 32% efficiency? > Here is a pointer to some measured latencies > http://www.anandtech.com/IT/showdoc.aspx?i=2772&p=4 Interesting, the woodcrest latencies are much higher than I've seen elsewhere. It's been awhile since I looked at the lmbench source, I seem to recall it used to do a negative stride, but then one of the the architectures detected it and successfully prefetched it. I'll check, if it doesn't do true random accesses I'll post a threaded benchmark that does. > Incidentally, the same site dwells on low latency of Core 2 > http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=5 > Anybody run stream on it? Note the 256byte stride, looks like it's a test of the prefetcher more than true memory latency. Note this link Mark H. brought to my attention: http://www.anandtech.com/mb/showdoc.aspx?i=2810&p=4 It shows that a mismatch of FSB and 2 x memory speed hurts performance signficantly. The DDR2-533 + core 2 FSB/1066 significantly outperforms the DDR2-667 + core 2 FSB/1066. If this holds true on woodcrest it would seem that many of the woodcrest systems available from tier-1 vendors are shipping with a significantly sub-optimal memory configuration. It's rather counter intuitive for the faster (ddr2-667) memory to lead to only 75% of the slower memory's (ddr2-533's) performance. Based on that one might speculate that a DDR2-667 + core2/woodcrest/1333 would score significantly better. Although I've yet to find a compiler, os, and BIOS that demonstrates significantly better numbers. Offline I've reports from people who: * Have the FSB snoop filter off by default in BIOS * Have the adjacent cache prefetch on (which would likely increase main memory latency) * Have dimms in the wrong slots (4 dimms on 2 channels, not 4 on 4 channels). I've also seen intel documents on the chipset showing that stream numbers increase with the number of dimms and ranks. So 4 single rank dimms only get 65% of possible. 8 single rank 80%, and 8 double rank dimms get 100% of possible stream bandwidth. Alas no absolute numbers. Seems a little strange to get only 65% of possible stream bandwidth with 4 dimms, after all their peak bandwidth is 21GB/sec. Maybe FBdimms and/or the current chipset only allows a few pages open per bank/rank? So it takes 16 bank/ranks to allow for good stream performance (read that as allowing the prefetcher to hide 120ns or so of main memory latency). I'm guessing the best woodcrest stream numbers will be: * Pathscale's compiler with -mp -O3 or possibly -mp -Ofast * 8 dual rank dimms * FSB snoop filter on (in BIOS) * Prefetch adjacent cache lines (in BIOS) -- Bill Broadley Computational Science and Engineering UC Davis _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf