Re: [Beowulf] Multicore Is Bad News For Supercomputers

Michael Brown Mon, 08 Dec 2008 15:21:28 -0800

Mark Hahn wrote:

(Well, duh).


yeah - the point seems to be that we (still) need to scale memory
along with core count.  not just memory bandwidth but also concurrency
(number of banks), though "ieee spectrum online for tech insiders"
doesn't get into that kind of depth :(

I think this needs to be elaborated a little for those who don't know thelayout of SDRAM ...

A typical chip that may be used in a 4 GB DIMM would be a 2 Gbit SDRAM chip,of which there would be 16 (total 32 Gbits = 4 Gbytes). Each chipcontributes 8 bits towards the 64-bit DIMM interface, so there's two"ranks", each comprised of 8 chips. Each rank operates independently fromthe other, but share (and are limited by) the bandwidth of the memorychannel. From here I'm going to be using the Micron MT47H128M16 as the SDRAMchip, because I have the datasheet, though other chips are probably verysimilar.

Each SDRAM chip internally is make up of 8 banks of 32 K * 8 Kbit memoryarrays. Each bank can be controlled seperately but shares the DIMMbandwidth, much like each rank does. Before accessing a particular memorycell, the whole 8 Kbit "row" needs to be activated. Only one row can beactive per bank at any point in time. Once the memory controller is donewith a particular row, it needs to be "precharged", which basically equatesto writing it back into the main array. Activating and precharging arerelatively expensive operations - precharging one row and activating anothertakes at least 11 cycles (tRTP + tRP) and 7 cycles (tRCD) respectively attop speed (DDR2-1066) for the Micron chips mentioned, during which no datacan be read from or written to the bank. Precharging takes another 4 cyclesif you've just written to the bank.

The second thing to know is that processors operate in cacheline sizedblocks. Current x86 cache lines are 64 bytes, IIRC. In a dual-channel systemwith channel interleaving, odd-numbered cachelines come from one channel,and even numbered cachelines from the other. So each cacheline fill requires8 bytes read per chip (which fits in nicely with the standard burst lengthof 8, since each read is 8 bits), coming out to 128 cachelines per row. Likechannel interleaving, bank interleaving is also used. So:

[] Cacheline 0 comes from channel 0, bank 0
[] Cacheline 1 comes from channel 1, bank 0
[] Cacheline 2 comes from channel 0, bank 1
[] Cacheline 3 comes from channel 1, bank 1
:
:
[] Cacheline 14 comes from channel 0, bank 7
[] Cacheline 15 comes from channel 1, bank 7

So this pattern repeats every 1 KB, and every 128 KB a new row needs to beopened on each bank. IIRC, rank interleaving is done on AMD quad-coreprocessors, but not the older dual-core processors nor Intel's discretenorthbridges. I'm not sure about Nehalem.

This is all fine and dandy on a single-core system. The bank interleavingallows the channel to be active by using another bank when one bank is beingactivated or precharged. With a good prefetcher, you can hit close to 100%utilization of the channel. However, it can cause problems on a multi-coresystem. Say if you have two cores, each scanning through separate 1 MBblocks of memory. Each core is demanding a different row from the same bank,so the memory controller has to keep on changing rows. This may not appearto be an issue at first glance - after all, we have 128 cycles between eachCPU hitting a particular bank (8 bursts * 8 cycles per burst * 2 processorssharing bandwidth), so we've got 64 cycles between row changes. That's overtwice what we need (unless we're using 1 GB or smaller DIMMS, which onlyhave 4 pages so things become tight).

The killer though is latency - instead of 4-ish cycles CAS delay per read,we're now looking at 22 for a precharge + activate + CAS. In a streamingsituation, this doesn't hurt too much as a good prefetcher would already beindicating it needs the next cacheline. But if you've got access patternsthat aren't extremely prefetcher-friendly, you're going to suffer.

Simply cranking up the number of banks doesn't help this. You've still gotthrashing, you're just thrashing more banks. Turning up the cacheline sizecan help, as you transfer more data per stall. The extreme solution is toturn off bank interleaving. Our memory layout now looks like:

[] Cacheline 0 comes from channel 0, bank 0, row 0, offset 0 bits
[] Cacheline 1 comes from channel 1, bank 0, row 0, offset 0 bits
[] Cacheline 2 comes from channel 0, bank 0, row 0, offset 64 bits
[] Cacheline 3 comes from channel 1, bank 0, row 0, offset 64 bits
:
:
[] Cacheline 254 comes from channel 0, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 255 comes from channel 1, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 256 comes from channel 0, bank 0, row 1, offset 0 bits
[] Cacheline 257 comes from channel 1, bank 0, row 1, offset 0 bits

So a new row every 16 KB, and a new bank every 512 MB (and a new rank every4 GB).

For a single core, this generally doesn't have a big effect, since the 18cycle precharge+activate delay can often be hidden by a good prefetcher, andin any case only comes around every 16 KB (as opposed to every 128 KB forbank interleaving, so it's a bit more frequent, though for large memoryblocks it's a wash). However, this is a big killer for multicore - if youhave two cores walking through the same 512 MB area, they'll be thrashingthe same bank. Not only does latency suffer, but bandwidth as well since theother 7 banks can't be used to cover up the wasted time. Every 8 cycles ofreading will require 18 cycles of sitting around waiting for the bank,dropping bandwidth by about 70%.

However, with proper OS support this can be a bit of a win. By associatingbanks (512 MB memory blocks) to cores in the standard NUMA way, each corecan be operating out of its own bank. There's no bank thrashing at all,which allows much looser requirements on activation and precharge, which inturn can allow higher speeds. With channel interleaving, we can have up to 8cores/threads operating in this way. With independent channels (alaBarcelona) we can do 16. Of course, this isn't ideal either. A row changewill stall the associated CPU and can't be hidden, so ideally we want atleast 2 banks per CPU, interleaved. Also, shared memory will be hurt underthis scheme (bandwidth and latency) since it will experience bank thrashingand will only have 2 banks. To cover the activate and precharge times, weneed at least 4 banks, so for a quad core CPU we need a total of 16 memorybanks in the system, partly interleaved. 8 banks per core can improveperformance further with certain access patterns. Also, to keep goodsingle-core performance, we'll need to use both channels. In this case,4-way bank interleaving per channel (so two sets of 4-way interleaves), withchannel interleaving and no rank interleaving would work, though again 8-waybank interleaving would be better if there's enough to go around.

This setup is electronically obtainable in current systems, if you use twodual-rank DIMMS per channel and no rank interleaving. In this case, you have8-way bank interleaving, with channel interleaving and with the 4 ranks incontiguous memory blocks. With AMD's Barcelona, you can get away with asingle dual-rank DIMM per channel if you run the two channels independently(though in this case single-threaded performance is compromised, becauseeach core will tend to only access memory on a single controller). An8-thread system like Nehalam + hyperthreading would ideally like 64 banks.Because of Nehalem's wonky memory controller (seriously, who was the guy incharge who settled on three channels? I can imagine the joy of the memorycontroller engineers when they found out they'd have to implement adivide-by-three in a critical path) it'd be a little more difficult to getworking there, though there's still enough banks to go around (12 banks perthread).

However, I'm not sure of any OSes that support this quasi-NUMA. I'm guessingit could be hacked into Linux without too much trouble, given that real NUMAsupport is already there. It's something I've been meaning to look into fora while, but I've never had the time to really get my hands dirty trying tofigure out Linux's NUMA architecture.



Cheers,

Michael

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Multicore Is Bad News For Supercomputers

Reply via email to