On Nov 19, 2012, at 7:23 PM, Robert G. Brown wrote: > On Mon, 19 Nov 2012, Vincent Diepeveen wrote: > >> >> On Nov 19, 2012, at 6:12 PM, Robert G. Brown wrote: >> >>> On Mon, 19 Nov 2012, Vincent Diepeveen wrote: >>> >>>> If you measure memory latency at all 8 cores at the same time, it's >>>> even more horrible. >>> >>> Thanks for a remarkably clear and useful reply, Vincent. This >>> nearly >>> precisely mirrors my own measurements with a more floating point >>> intensive task. The larger i7-3770 cache and its 8 operational >>> contexts >>> (it is a four core system but it maintains two completely >>> independent >>> contexts per core, IIRC) seem to give it an overwhelming advantage >>> over >>> the FX with its eight "real" cores but much smaller cache. >>> Interesting >>> to see that this continues with the (I assume) integer/logic >>> intensive >>> chess code. >> >> Maybe you meant saying it correctly but wrote it wrong. >> >> The FX8150 has a huge SLOW L2 cache of 1MB or so (2MB a module) and >> the i7's all have >> a small FAST L2 cache of around 256KB. > > Um, from cat /proc/cpuinfo: > > processor : 7 > vendor_id : GenuineIntel > cpu family : 6 > model : 58 > model name : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz > stepping : 9 > cpu MHz : 1600.000 > cache size : 8192 KB > ...
That's the L3 cache. The L3 cache is just SRAM so to speak. What i referred to is the big difference in L2 cache. Here is what you look for: http://en.wikipedia.org/wiki/Ivy_Bridge_%28microarchitecture%29 CPUID code 0306A9h Product code 80637 (desktop) L1 cache 64 kB per core L2 cache 256 kB per core L3 cache 3 MB to 8 MB shared The AMD FX-8150 on the other hand has a whopping 1MB of L2 each core. The L3 cache is 8MB as well. Though i probably ask for the fury of hardware engineers if i say that between the i7 and the bulldozer there is no big deal difference between the L1 and the L3, there is a huge difference between the L2 Both CPU's decode 4 instructions a clock by the way. AMD for 1 module, Intel for 1 core. The real big difference is therefore the latencies of the caches and RAM and the speed (or slowness) of the execution units. AMD had a new design trick up their sleeves btw to do the decoding 'better'. Splitting it from 1 bundle of 4 instructions to 2 bundles of 2 instructions each clock. That should speed a few percent up for parallel workloads... Yet you compare with quadcore intels, which is a big nonsense IMHO. More interesting is the sixcore intels and 8 core intels if you ask me. > > Note well that 8192 KB is 8 MB as of last time I looked. Although > my FX > box is turned off at the moment because it is loud in ADDITION to > being > hot, I recall it had a 2 MB cache total. I'll boot it later today and > look again, but I think I reported all of this on list two weeks ago, > with graphs. > > I don't know if these cache sizes are per core or per system, but note > that I'm up to "processor 7" on "core 3" according to other lines. > The > kernel, at least, thinks that the Intel has way more L2 than the AMD. > Since the only way I can possibly interpret my job getting linear > speedup on four cores through eight tasks is for the job to more or > less > be executing out of cache so that it was just doing hardware context > switches into separate unblocked ALUs or the like, I sort of > believe it. > >> >> If we measure accurately then the FX8150 gets a huge speedup from the >> SMT. >> So moving from 4 cores to 8 cores it benefits really a lot. Exactly >> what you would expect >> with a slow L2 cache. >> >> The i7 on the other hand hardly profits from Hyperthreading. In >> general the higher you clock (or overclock) >> the i7 it profits more yet we speak about a small percentage still. >> 20% at lower clock up to 30% high clock >> for Diep. >> >> For most number cruncing floating point code here (prime numbers) the >> speedup from hyperthraeding >> is more around 5%, so it hardly benefits there. >> >> At the more modern i7's the multiplication unit has been speeded up. >> So it can deliver a much bigger >> throughput there. >> >> This whereas the FX8150 has been slowed down factor 2. >> >>> >>> Basically, the i7 looks like a butt-kicking good processor, with >>> the one >>> problem being that it doesn't look like a multiprocessing cpu (at >>> least >>> I can't find a dual i7 motherboard, although in principle it >>> appears to >>> be possible, leaving one with Xeons that don't LOOK like they would >>> perform as well although I'd be interested in information on that as >>> well. >> >> The i7-3770k is the latest i7 and it's Ivy Bridge. >> It's really low power though, just around 50 watts. >> >> The Xeons are all older generation i7, a Sandy Bridge. They eat lots >> of power, >> yet performance is very good. >> >> Intel wants to cash in on them, AMD really messed up in that market >> segment. >> >> For most servers in server market, not to confuse with HPC, >> power consumption does matter and intel is winning the battle there. >> >>> >>> At the moment, single processor i7's look like they might >>> actually be >>> the world's fastest, at least on a per core basis. OTOH, it might >>> well >>> be that putting two of them on a single board would horribly >>> saturate >>> the memory bus and cause memory management collisions and worse and >>> cost >>> them their advantage. >> >> In itself AMD's coherency protocol is in some areas superior to >> intels. >> >> Intel already struggles there for a big number of years, which is >> especially visible in the 4 socket domain >> not to mention 8 sockets. >> >> Note that newer Xeons have a few features which AMD doesn't have, >> which in some software >> might kick butt. That's synchronisation within the L3s, whereas AMD >> goes via the RAM. >> >> I'm not into patents, yet it's possible one reason of succes is that >> AMD took over DEC Alpha's >> master slave concept. I'm not sure whether intels problem was to get >> around those patents. >> >> In either case, latency to the RAM intel always was faster than AMD, >> except for when intel still was >> off die with the RAM and opteron released. >> >> AMD then got quickly 50% market share in the server market with >> opteron for a short while. >> >> I wrote a testprogram to measure latency to the RAM doing just random >> reads of 8 bytes into a big buffer, >> with all cores at the same time. >> >> From head i remember next numbers: >> >> i7 single chip : 60 - 70 ns >> dual i7 Xeon 3.4Ghz : 90 ns >> Phenom DDR3 : 100 ns >> FX8150 : 160+ ns (thanks to Joel Hruska for benchmarking) >> >> So AMD's design idea now to design a chip with a latency even worse >> than their previous generation Phenom core >> is not explainable for the servermarket. They did do well previous >> time when latency to the RAM was BETTER than from >> intel. So getting it worse there is a weird decision. >> >> This is not just architect faults. This is something so important to >> a company like AMD, the CEO must be involved in such >> decisions. >> >> In all server loads this latency issue of the bulldozer is a BIG >> issue why it is so slow. >> >> Both the L2 latency as well as the RAM. >> >> Please note if you measure single core to 4 cores the latency at >> bulldozer is a lot faster. It slows down really a lot when putting >> all cores under load. >> >>> >>> I'm getting ready to do some very data intensive stuff -- terabyte- >>> scale >>> datasets being chewed to pieces basically -- to the point where my >>> "cluster" will probably be a pile of RAIDs each with its own private >>> copy of the datasets in questions and equipped with an i7 >>> motherboard, >>> which seems odd somehow (as the i7 motherboards aren't generally >>> configured as "server" motherboards) but the Xeons all run at lower >>> clock and are older technology. >>> >>> Comments from anyone else? >>> >> >> cheapskate clusters with low clocked cpu's are total unbeatable >> pricewise. >> >> I don't know whether you can use AVX. If not did you consider buying >> for $150 a bunch of nodes 2 socket Xeon L5420 or something >> with 8 GB ram? >> >> For a single i7 system you can get 3 to 4 of them. >> >> Another idea is using a 48 core AMD system. Though on ebay the cpu's >> are a tad more expensive now, >> the 6180SE if you buy 4 of them and a motherboard, you have 48 cores, >> huge RAM and 4 memory controllers and 6 memory channels >> a socket. >> >> A total of 24 memory channels or so (if i did do my math ok). >> >> Until recently these 6180SE cpu's were $450 on ebay, though i see >> them now for $650 or so. >> >> If your workload parallellizes well it could be an idea. They do not >> have AVX however. >> >> For what you are gonna do maybe your biggest pal is ebay, regardless >> what you want to order. >> >> >>> rgb >>> >>>> >>>>> I would have hoped that AMD would dig in an innovate and >>>>> regain at least parity if not the lead, because it is good for the >>>>> industry for Intel to have serious competition, but while Intel >>>>> could >>>>> make money and survive as second best to AMD, AMD can't make any >>>>> money >>>>> as second best to Intel... >>>> >>>> We must split of course the 2 worlds of HPC performance. >>>> In fact htere is 3 but let's do a rough 2 world division >>>> >>>> a) floating point or vectorized performance (can be integers as >>>> well) >>>> >>>> We skip A : the manycores have won there. >>>> >>>> b) integer performance non-vectorized >>>> >>>> For integers and branches if i take a huge program like Diep. >>>> >>>> http://www.lostcircuits.com/mambo//index.php? >>>> option=com_content&task=view&id=105&Itemid=42&limit=1&limitstart=13 >>>> >>>> More is better. >>>> >>>> i7-3960X-EE : 2.0 Million chess positions a second (12 logical >>>> cores) >>>> i7-980x turbo: 1.85 Million chess positions a second (12 logical >>>> cores) >>>> i7-3770k: 1.47 million chess positions a second (8 logical >>>> cores) >>>> AMD Phenom X6 1100T : 1.34 million chess positions a second (6 >>>> cores) >>>> AMD Phenom X6 1090T : 1.30 million chess positions a second (6 >>>> cores) >>>> FX-8150 : 1.22 million chesspositions a second (8 mini cores) >>>> >>>> The FX-8150 is AMD's latest 'bulldozer' CPU. >>>> >>>> The problem is the new generation FX-8150 at a NEW process >>>> technology, with 2 billion transistors or so (caches counted >>>> - the initial press release from AMD - not the later one where they >>>> creatively not counting things reached 1.2 billion) is not beating >>>> their own old design. >>>> >>>> Furthermore another big problem is power usage. >>>> >>>> http://www.lostcircuits.com/mambo//index.php? >>>> option=com_content&task=view&id=105&Itemid=42&limit=1&limitstart=6 >>>> >>>> Under full load: >>>> >>>> Phenom X6 1090T : 69.6 watt, >>>> Phenom X6 1100T : 92 watt >>>> >>>> We see how the 1100T already was clocked a tad too high by AMD, >>>> which >>>> explains the huge power increase. >>>> >>>> Now the FX-8150 : 115.2 watt >>>> >>>> As if Law of Moore garantueeing progress doesn't exist... >>>> >>>> As for you, in many benchmarks you did do maybe multiplication was >>>> important. Each minicore has its own multiplication unit. >>>> Sounds good huh? >>>> >>>> So far the good news: the problem is: it's also over 2 times slower >>>> that unit... >>>> >>>> Please note that bulldozer does have AVX. From benchmarks we know >>>> that both intel as well as AMD with this bulldozer, >>>> had tried to optimize performance for game. Games using AVX >>>> especially. >>>> >>>> It's not doing bad there in fact. Worse than the quadcore intels. I >>>> don't want a quadcore chip though. >>>> I want a million cores. >>>> >>>>> >>>>> rgb >>>>> >>>>>> >>>>>> -- >>>>>> Doug >>>>>> >>>>>> -- >>>>>> Mailscanner: Clean >>>>>> >>>>>> _______________________________________________ >>>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>>> Computing >>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>> >>>>> >>>>> Robert G. Brown http://www.phy.duke.edu/ >>>>> ~rgb/ >>>>> Duke University Dept. of Physics, Box 90305 >>>>> Durham, N.C. 27708-0305 >>>>> Phone: 1-919-660-2567 Fax: 919-660-2525 >>>>> email:r...@phy.duke.edu >>>>> >>>>> >>>>> _______________________________________________ >>>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>>> Computing >>>>> To change your subscription (digest mode or unsubscribe) visit >>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>> >>>> _______________________________________________ >>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >>>> Computing >>>> To change your subscription (digest mode or unsubscribe) visit >>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>> >>> >>> Robert G. Brown http://www.phy.duke.edu/~rgb/ >>> Duke University Dept. of Physics, Box 90305 >>> Durham, N.C. 27708-0305 >>> Phone: 1-919-660-2567 Fax: 919-660-2525 email:r...@phy.duke.edu >>> >>> >> >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:r...@phy.duke.edu > > _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf