For those interested in latency. I wrote a pthread based latency tester that will access N integers randomly per thread. Each member of the array is accessed once. All the numbers below are for N=1,000,000 integers. Every integer is loaded exactly once, randomly.
The first number is the latency per thread, so it increases with memory contention. The second number is the "effective" ns, where I divide the run time[1] of all threads and divide it by the integers retreived. It should decrease with increased threads if the machine has the CPU and memory system parallelism to avoid contention. 1 thread 2 threads 4 threads Dual Opteron 275[2] 83.69ns/83.69ns 80ns/52.08ns 85ns/21.72ns Quad opteron 846[3] 108.07/108.07ns 115ns/61.39ns 110ns/27.89ns Dual Woodcrest-2.66[2] 107.18/107.18ns 108ns/54.03ns 118ns/29.69ns Dual core amd64-2.2GHz[5] 89.45/89.45ns 89.45ns/44.72 145ns/52.76ns AMD64 3200[4]-2.0GHz 69.74ns/69.74ns 69ns/69.31ns 137ns/69.85ns Dual socket nacoma 3.4GHz[6] 130.45/130.45ns 133/66.72ns 230ns/67.72ns Dual core p4-3.0[6] 115.45/115.46ns 185ns/101.03ns 283ns/92.67ns Dual it2-1.4GHz[6] 200.47/200.47ns 203ns/101.92ns 362ns/101.57ns I'm happy to say that Pathscale, Intel, GCC-3, and GCC-4 all share mostly identical performance. Although, I had to be very careful with pathscale to avoid the benchmark routine from getting optimized away. Anyone have a Rev F opteron handy? [1] Where runtime = max(finishtimes)-min(starttimes) [2] Dual socket, dual core = 4 cores [3] Quad socket, single core = 4 cores [4] Single core/single socket = 1 core [5] Dual core/single socket = 2 cores [6] Dual socket, single core = 2 cores. -- Bill Broadley Computational Science and Engineering UC Davis _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf