Mark Hahn wrote:
(Well, duh).

yeah - the point seems to be that we (still) need to scale memory
along with core count.  not just memory bandwidth but also concurrency
(number of banks), though "ieee spectrum online for tech insiders"
doesn't get into that kind of depth :(

I think this needs to be elaborated a little for those who don't know the layout of SDRAM ...

A typical chip that may be used in a 4 GB DIMM would be a 2 Gbit SDRAM chip, of which there would be 16 (total 32 Gbits = 4 Gbytes). Each chip contributes 8 bits towards the 64-bit DIMM interface, so there's two "ranks", each comprised of 8 chips. Each rank operates independently from the other, but share (and are limited by) the bandwidth of the memory channel. From here I'm going to be using the Micron MT47H128M16 as the SDRAM chip, because I have the datasheet, though other chips are probably very similar.

Each SDRAM chip internally is make up of 8 banks of 32 K * 8 Kbit memory arrays. Each bank can be controlled seperately but shares the DIMM bandwidth, much like each rank does. Before accessing a particular memory cell, the whole 8 Kbit "row" needs to be activated. Only one row can be active per bank at any point in time. Once the memory controller is done with a particular row, it needs to be "precharged", which basically equates to writing it back into the main array. Activating and precharging are relatively expensive operations - precharging one row and activating another takes at least 11 cycles (tRTP + tRP) and 7 cycles (tRCD) respectively at top speed (DDR2-1066) for the Micron chips mentioned, during which no data can be read from or written to the bank. Precharging takes another 4 cycles if you've just written to the bank.

The second thing to know is that processors operate in cacheline sized blocks. Current x86 cache lines are 64 bytes, IIRC. In a dual-channel system with channel interleaving, odd-numbered cachelines come from one channel, and even numbered cachelines from the other. So each cacheline fill requires 8 bytes read per chip (which fits in nicely with the standard burst length of 8, since each read is 8 bits), coming out to 128 cachelines per row. Like channel interleaving, bank interleaving is also used. So:
[] Cacheline 0 comes from channel 0, bank 0
[] Cacheline 1 comes from channel 1, bank 0
[] Cacheline 2 comes from channel 0, bank 1
[] Cacheline 3 comes from channel 1, bank 1
:
:
[] Cacheline 14 comes from channel 0, bank 7
[] Cacheline 15 comes from channel 1, bank 7
So this pattern repeats every 1 KB, and every 128 KB a new row needs to be opened on each bank. IIRC, rank interleaving is done on AMD quad-core processors, but not the older dual-core processors nor Intel's discrete northbridges. I'm not sure about Nehalem.

This is all fine and dandy on a single-core system. The bank interleaving allows the channel to be active by using another bank when one bank is being activated or precharged. With a good prefetcher, you can hit close to 100% utilization of the channel. However, it can cause problems on a multi-core system. Say if you have two cores, each scanning through separate 1 MB blocks of memory. Each core is demanding a different row from the same bank, so the memory controller has to keep on changing rows. This may not appear to be an issue at first glance - after all, we have 128 cycles between each CPU hitting a particular bank (8 bursts * 8 cycles per burst * 2 processors sharing bandwidth), so we've got 64 cycles between row changes. That's over twice what we need (unless we're using 1 GB or smaller DIMMS, which only have 4 pages so things become tight).

The killer though is latency - instead of 4-ish cycles CAS delay per read, we're now looking at 22 for a precharge + activate + CAS. In a streaming situation, this doesn't hurt too much as a good prefetcher would already be indicating it needs the next cacheline. But if you've got access patterns that aren't extremely prefetcher-friendly, you're going to suffer.

Simply cranking up the number of banks doesn't help this. You've still got thrashing, you're just thrashing more banks. Turning up the cacheline size can help, as you transfer more data per stall. The extreme solution is to turn off bank interleaving. Our memory layout now looks like:
[] Cacheline 0 comes from channel 0, bank 0, row 0, offset 0 bits
[] Cacheline 1 comes from channel 1, bank 0, row 0, offset 0 bits
[] Cacheline 2 comes from channel 0, bank 0, row 0, offset 64 bits
[] Cacheline 3 comes from channel 1, bank 0, row 0, offset 64 bits
:
:
[] Cacheline 254 comes from channel 0, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 255 comes from channel 1, bank 0, row 0, offset 8 K - 64 bits
[] Cacheline 256 comes from channel 0, bank 0, row 1, offset 0 bits
[] Cacheline 257 comes from channel 1, bank 0, row 1, offset 0 bits
So a new row every 16 KB, and a new bank every 512 MB (and a new rank every 4 GB).

For a single core, this generally doesn't have a big effect, since the 18 cycle precharge+activate delay can often be hidden by a good prefetcher, and in any case only comes around every 16 KB (as opposed to every 128 KB for bank interleaving, so it's a bit more frequent, though for large memory blocks it's a wash). However, this is a big killer for multicore - if you have two cores walking through the same 512 MB area, they'll be thrashing the same bank. Not only does latency suffer, but bandwidth as well since the other 7 banks can't be used to cover up the wasted time. Every 8 cycles of reading will require 18 cycles of sitting around waiting for the bank, dropping bandwidth by about 70%.

However, with proper OS support this can be a bit of a win. By associating banks (512 MB memory blocks) to cores in the standard NUMA way, each core can be operating out of its own bank. There's no bank thrashing at all, which allows much looser requirements on activation and precharge, which in turn can allow higher speeds. With channel interleaving, we can have up to 8 cores/threads operating in this way. With independent channels (ala Barcelona) we can do 16. Of course, this isn't ideal either. A row change will stall the associated CPU and can't be hidden, so ideally we want at least 2 banks per CPU, interleaved. Also, shared memory will be hurt under this scheme (bandwidth and latency) since it will experience bank thrashing and will only have 2 banks. To cover the activate and precharge times, we need at least 4 banks, so for a quad core CPU we need a total of 16 memory banks in the system, partly interleaved. 8 banks per core can improve performance further with certain access patterns. Also, to keep good single-core performance, we'll need to use both channels. In this case, 4-way bank interleaving per channel (so two sets of 4-way interleaves), with channel interleaving and no rank interleaving would work, though again 8-way bank interleaving would be better if there's enough to go around.

This setup is electronically obtainable in current systems, if you use two dual-rank DIMMS per channel and no rank interleaving. In this case, you have 8-way bank interleaving, with channel interleaving and with the 4 ranks in contiguous memory blocks. With AMD's Barcelona, you can get away with a single dual-rank DIMM per channel if you run the two channels independently (though in this case single-threaded performance is compromised, because each core will tend to only access memory on a single controller). An 8-thread system like Nehalam + hyperthreading would ideally like 64 banks. Because of Nehalem's wonky memory controller (seriously, who was the guy in charge who settled on three channels? I can imagine the joy of the memory controller engineers when they found out they'd have to implement a divide-by-three in a critical path) it'd be a little more difficult to get working there, though there's still enough banks to go around (12 banks per thread).

However, I'm not sure of any OSes that support this quasi-NUMA. I'm guessing it could be hacked into Linux without too much trouble, given that real NUMA support is already there. It's something I've been meaning to look into for a while, but I've never had the time to really get my hands dirty trying to figure out Linux's NUMA architecture.


Cheers,
Michael
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to