Mark Hahn wrote: >>> The same site reports that the X6800, a 2.93 GHz Core2 and sees >>> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS). > > hmm, those numbers are pretty low - peak should be 2.93*4 or 8, > and I'd expect 80% of peak or 19 Gflops/core for this comparison > (Opterons can do 90%, at least on my machine using HPL.) I've consulted with some other information just to make sure I get this right. We can't naively say that Core 2 maxes out at clock*4 or clock*8 for theoretical peak flops. Port 1 on the FPU can handle 4xSP flops, but only simple operations like FPADD. Port 2 can handle FPMUL and FPDIV (therefore FPADD as well) on a 4xSP vector.
So, there is a hard floor on theoretical Core 2 floating point performance of clock*4 flops (for pure FPMUL and FPDIV), and a hard ceiling of clock*8 flops (for a mix where FPADD is >=50%). Looking at the source code, SGEMM is a FPMUL bruiser, which puts peak performance closer to the floor than the ceiling. 12.5 gflops looks like an accurate number for Core 2 SGEMM. > so the paper shows 80.6 Gflops SGEMM for 8 SPE's; it's only fair to > compare this to 2 or 4 Core2 cores (37.5 and 75 Gflops!) Going by die size, Cell would compare with a hypothetical 3 core Core 2 CPU. (Cell is apparently ~220mm^2, Core 2 Duo ~140mm^2) >> indicative of per core performance on Core 2. Is it safe to say that >> Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto >> BLAS? > > peak SGEMM/core would be 3*8=24, so 15 sounds quite low. > >>> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a > > do you know of something crippled in the pre-production Cell chips? Clock speed? <snip> > I don't think there's anything too dubious about 80% of theoretical for > Core2. but I also didn't think the Sequoia stuff was such a cheap hack > as you imply (not to put words into your mouth ;) If we are to believe what's in the LBL paper, IBM is getting ~200 gflops peak on SGEMM with full clock Cell engineering samples, which means peak should be ~150 gflops on the chip in the article. 52% of peak achieved by Sequoia is probably a little low. >> like to see a benchmark comparison of SGEMM (and DGEMM) using Core >> 2-optimized BLAS vs. Cell-optimized BLAS, thereby making a useful >> conclusion about how interesting Cell is for HPC. > > actually, Sequoia seems precisely like the structure you need to make Cell > work, since it's whole purpose is to express the rather constrained way > that memory is used in Cell. the paper is actually pretty clear on > where the Cell > spends its time, and for SGEMM, it's executing the "leaf" code, which is > IBM's Cell library. It's whole purpose is to express distributed computers with arbitrary memory topologies, from SMP (NUMA and non-) to clusters. It actually looks really cool. > I guess the prototype might be really bad, or Sequoia might be broken in a > way not hinted in the paper, or IBM's Cell intrinsic library could be > terrible. but the paper seems on the up-and-up, and the scaling curves and > leave-vs-communication figures surely make Cell look underwhelming, > at least if you assume, as I do, that it has to deliver a large speedup > to be worth investing in... The paper is more of a statement on the capabilities of the Sequoia compiler than the Cell processor. I don't think it's unreasonable to assume their SGEMM implementation was written for clarity rather than speed. -- Geoffrey D. Jacobs Go to the Chinese Restaurant, Order the Special _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf