On 01/11/2013 04:01 AM, Joshua mora acosta wrote: > Hi Bill, > AMD should pay you for these wise comments ;) > > But since this list is about providing feedback, and sharing knowledge, I > would like to add something to your comments, and somewhat HW agnostic. When > you are running stream benchmark it is an easy way to find out what the memory > controllers are capable.
Well it's my own code, last I checked stream didn't do dynamic allocations or use pthreads. Not to mention various tweaks for NUMA, affinity, and related. > Stream does minimal computation, at most the triad but it really exposes the > bottleneck (in negative terms) or the throughput (in positive terms) of the > processor and platform (when accounting multiple processors connected by some > type of fabric: cHT, QPI, network) when looking at the aggregated memory > bandwidth. Correct, stream a lousy benchmark to quantify application performance. Just wanted to counter some comments I've heard about AMD's memory system. > The main comment I would like to add is with respect to your stream bandwidth > results. Looking at your log2 chart, it says that AMD delivers about ~100GB/s > on 4P system and on Intel it delivers ~30GB/s on 2P systems. I may be reading > wrong in the chart but it should be about 140GB/s with AMD > (Interlagos/Abudhabi) with 1600MHz DDR3 memory and about 40GB/s with INTEL > (Nehalem/Westmere) with memory at 1333MHz DDR3 and about 75GB/s with > Sandybridge with memory at 1600MHz DDR3. Well in my experience there's 3 major numbers for sequential memory bandwidth: 1) the marketing numbers (clockspeed * width) which is approximately 50GB per socket for Intel/AMD with 4 channels. 2) Stream returned numbers using good compilers (intel, portland group, or open64) that only work with static arrays. Often 50-75% or so of the marketing numbers 3) Stream returned numbers using good compilers using dynamic allocation (malloc in c or new in c++) often 25-50% of the marketing numbers. From what I can tell the use of dynamic allocation disables non-temporal stores. Gcc usually matches dynamic allocation numbers (#3) with or without dynamic allocation. I wonder what percentage of bandwidth intensive codes dynamically allocate memory. > In order to do so, you want to use non temporal stores, which bypass the > regular process of cache coherence. Many applications behave that way since > you have to do a pass through the data and you may access it again (eg. in the I believe Intel, Portland Group, and Intel automatically do this, even when just doing the obvious: for (j=0; j<N; j++) // where N = large array c[j] = a[j]+b[j]; Sadly if a,b, or c were dynamically allocated that seems to disable the non temporal stores. For instance, open64, openmp, 1831MB array: Function Rate (MB/s) Avg time Min time Max time Copy: 101336.5507 0.0135 0.0126 0.0146 Scale: 98265.0155 0.0141 0.0130 0.0153 Add: 103543.0881 0.0202 0.0185 0.0225 Triad: 104677.6852 0.0194 0.0183 0.0213 If I switch to using malloc: 97,99c97 < static double a[N+OFFSET], < b[N+OFFSET], < c[N+OFFSET]; --- > static double *a,*b,*c; 134a133,135 > a = (double *) malloc ((N+OFFSET)*sizeof(double)); > b = (double *) malloc ((N+OFFSET)*sizeof(double)); > c = (double *) malloc ((N+OFFSET)*sizeof(double)); Copy: 74228.2843 0.0178 0.0172 0.0185 Scale: 74310.4782 0.0180 0.0172 0.0189 Add: 82776.3594 0.0240 0.0232 0.0249 Triad: 82598.0664 0.0239 0.0232 0.0250 > Finally, I have done a chart of performance/dollar for a wide range of > processor variants, taking as performance both FLOPs and memory bandwidth and > assuming equal cost of chassis and amount of memory, dividing the performance > by the cost of the processor. I agree that the costs of chassis, ram, motherboard and related are very similar. But it's seems odd to evaluate price/performance without using the system (not CPU) price. The best price/perf CPU will be very often be different than the CPU for the best price/perf node. While interesting, when making design/purchase decisions I look at price/performance per node. > I am attaching it to this email. I took the cost of the processors from > publicly available information on both AMD and INTEL processors. I know that > price varies for each deal but as a fair as possible estimate, I get that > Perf/$ is 2X on AMD than on INTEL, regardless of looking at FLOP/s or GB/s, > and comparing similar processor models (ie. 8c INTEL vs 16c AMD). Did you intentionally ignore the current generation AMDs? Personally I'd find a CPU2006 per $ more interesting (Int or FP rate). > You can make the chart by yourself if you know how to compute real FLOPs and > real bandwidth. Normally I take wall clock time on an application justifying the purchase of a cluster / cost of node. _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf