Hi Bill, AMD should pay you for these wise comments ;) But since this list is about providing feedback, and sharing knowledge, I would like to add something to your comments, and somewhat HW agnostic. When you are running stream benchmark it is an easy way to find out what the memory controllers are capable.
More down to the usage of that, it translates for a wide variety of applications in terms of data processing throughput , and therefore into the real application's performance, because data is stored in RAM , fetched into caches, processed by cores and then returned to caches to be finally evicted back to RAM while bringing new chunks of data into cache, until the whole data set is processed. Stream does minimal computation, at most the triad but it really exposes the bottleneck (in negative terms) or the throughput (in positive terms) of the processor and platform (when accounting multiple processors connected by some type of fabric: cHT, QPI, network) when looking at the aggregated memory bandwidth. The main comment I would like to add is with respect to your stream bandwidth results. Looking at your log2 chart, it says that AMD delivers about ~100GB/s on 4P system and on Intel it delivers ~30GB/s on 2P systems. I may be reading wrong in the chart but it should be about 140GB/s with AMD (Interlagos/Abudhabi) with 1600MHz DDR3 memory and about 40GB/s with INTEL (Nehalem/Westmere) with memory at 1333MHz DDR3 and about 75GB/s with Sandybridge with memory at 1600MHz DDR3. In order to achieve such significantly higher memory bandwidth for this specific benchmark and there is where I want people to realize is that the data is used only once. There is a loop to repeat the experiment and average timings but in terms of processing , the data is only used once and then you bring a new chunk of data. In other words, there is no reusage of the data in the "near term". Therefore, you do want to boost the processing by getting rid of the data already processed by evicting it directly from the levels of cache closer to the core directly into RAM and bringing new fresh data from RAM into the caches rather than evicting the data recently processed into caches, wasting precious space to store data you dont need "for the time being". If you bypass the normal mechanism you are improving the amount of new data fetched into caches while storing quickly the crunched data into RAM. In order to do so, you want to use non temporal stores, which bypass the regular process of cache coherence. Many applications behave that way since you have to do a pass through the data and you may access it again (eg. in the next iteration) but after you have processed a bunch of more data (eg. on current iteration), hence preventing the cache to keep that data close. Better to get rid of it and bring it again when needed. If you do so, on those applications that are not cache friendly, which is the opposite to what I just described, you will improve greatly the performance of your applications. Finally, I have done a chart of performance/dollar for a wide range of processor variants, taking as performance both FLOPs and memory bandwidth and assuming equal cost of chassis and amount of memory, dividing the performance by the cost of the processor. I am attaching it to this email. I took the cost of the processors from publicly available information on both AMD and INTEL processors. I know that price varies for each deal but as a fair as possible estimate, I get that Perf/$ is 2X on AMD than on INTEL, regardless of looking at FLOP/s or GB/s, and comparing similar processor models (ie. 8c INTEL vs 16c AMD). You can make the chart by yourself if you know how to compute real FLOPs and real bandwidth. I also did the funny exercise to halve the price of the Intel processors (eg. 50% discount) and then the lines of Perf/USD of Intel went to match the lines of AMD, ie. to become Perf/USD competitive or on par without having to discount on AMD. Best regards, Joshua Mora _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf