I've been working on a pthread memory benchmark that is loosely modeled on McCalpin's stream. It's been quite a challenge to remove all the noise/lost performance from the benchmark to get close to performance I expected. Some of the obstacles: * For the compilers that tend to be better at stream (open64 and pathscale), you lose the performance if you just replace double a[],b[],c[] with double *a,*b,*c. Patch[1] available. I don't have a work around for this, suggestions welcome. Is it really necessary for dynamic arrays to be substantially slower than static? * You have to be very careful with pointer alignment both with cache lines, and each other * cpu_affinity (by CPU id) * numa (by socket id)
The results are relatively smooth graphs, here's an example, it's uselessly busy until you toggle off a few graphs (by clicking on the key): http://cse.ucdavis.edu/bill/pstream.svg The biggest puzzle I have now is what the previous generation intel quads, the current generation AMD quads, and numerous other CPUs show a big benefit in L1, while the nehalem shows no benefit. [1] http://cse.ucdavis.edu/bill/stream-malloc.patch _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
