On Tuesday 15 August 2006 17:25, Richard Walsh wrote: > Mark Hahn wrote: > >>> Good point which makes perfect sense to me. > >>> Given that the theoretical maximum is actually 21.3 GB/s > >>> the real maximum Triad number must be 21.3/3 = 7.1 GB/s. > > > > I don't get this - triad does two reads and one write. > > if you don't use store-through ('nt' versions of mov), > > then the write also implies a read for write-allocate > > (filling the cache line). > > without store-through, the peak theoretical number reported by > > stream should be 3*peak/4. the 4 is because there are 3r+1w, > > and the 3 because stream doesn't give credit for write-allocate. > > That looks right. So, one socket, with write allocate, >>should<< show: > > 10.5 GB/sec * .75 or 7.875 GBytes/sec > > and two sockets 15.75 GBytes/sec. The problem could be related > to competitive/ineffective use of the shared L2 cache or a bottleneck > in the North bridge. It would seem that a look at how the performance > grows as you add cores within versus across sockets should reveal this.
here you go (dell 2950 with 8 modules and streams compiled with icc-9.1 -O3: [EMAIL PROTECTED] streamd]# hostname ; date ; for i in 1 2 3 4 5 ; do export OMP_NUM_THREADS=$i ; ./streamd | egrep "Total memory re|Number of Th|Function |Copy:|Scale:|Add:|Triad:"; done tbox3 Fri Aug 11 17:59:22 CEST 2006 Total memory required = 457.8 MB. Number of Threads requested = 1 Function Rate (MB/s) Avg time Min time Max time Copy: 3945.5494 0.0812 0.0811 0.0813 Scale: 2914.9758 0.1098 0.1098 0.1099 Add: 3227.5618 0.1488 0.1487 0.1489 Triad: 3219.5307 0.1492 0.1491 0.1493 Total memory required = 457.8 MB. Number of Threads requested = 2 Function Rate (MB/s) Avg time Min time Max time Copy: 4324.2058 0.0741 0.0740 0.0742 Scale: 2999.9626 0.1068 0.1067 0.1069 Add: 3309.2733 0.1451 0.1450 0.1452 Triad: 3309.7031 0.1451 0.1450 0.1452 Total memory required = 457.8 MB. Number of Threads requested = 3 Function Rate (MB/s) Avg time Min time Max time Copy: 5422.5441 0.0590 0.0590 0.0590 Scale: 4102.8364 0.0780 0.0780 0.0781 Add: 4487.2464 0.1070 0.1070 0.1070 Triad: 4487.7465 0.1070 0.1070 0.1070 Total memory required = 457.8 MB. Number of Threads requested = 4 Function Rate (MB/s) Avg time Min time Max time Copy: 6023.2969 0.0532 0.0531 0.0533 Scale: 4862.4855 0.0658 0.0658 0.0659 Add: 5264.1973 0.0912 0.0912 0.0913 Triad: 5268.1782 0.0911 0.0911 0.0911 Total memory required = 457.8 MB. Number of Threads requested = 5 Function Rate (MB/s) Avg time Min time Max time Copy: 5504.9004 0.0582 0.0581 0.0582 Scale: 4318.9044 0.0786 0.0741 0.1147 Add: 4705.1016 0.1042 0.1020 0.1216 Triad: 4705.2885 0.1038 0.1020 0.1184 > Two cores on separate sockets should show higher numbers if it's > an L2 cache issue. If they are the same as those for 2 cores on one > socket then you have a problem with the North bridge or getting > full bandwidth from the FB-DIMMs. > > A complication in this test could be that in the one core per socket case > the whole L2 cache is allocated to a single core. Watching performance > change as the array sizes grow should reveal this. > > rbw
pgpzXDDWiYEKL.pgp
Description: PGP signature
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf