Hello,

Maybe many have already observed this phenomena. If you have a
1920x1080 monitor with 60Hz refresh rate and 32bpp color depth,
then sometimes it is possible to see glitches, which looks like
waves rolling all over over the screen. This happens when the
CPU or the Mali GPU are doing lots of memory accesses.

Even if you have not seen it yet, an easy way to reproduce the
problem would be to run https://github.com/ssvb/lima-memtester/
The cube is expected to just stay at the same spot and rotate, but
on A10 and 1920x1080 60Hz monitor it is also jumping up and down
wildly.

So what's going on?

Long story short. It looks like the DEBE DMA in Allwinner A10 is
just doing isolated 32 byte burst reads at regular intervals to
scan out the framebuffer and send this data over HDMI. How is it
causing all the troubles? I'll try to explain.

The DRAM controller has a number of host ports for different
peripherals and some knobs to configure them. For example, it is
possible to configure different priorities for these ports:
    
http://rhombus-tech.net/allwinner_a10/A10_register_guide/A10_DRAMC/#index38h2
And this stuff in the u-boot sources is here:
    
https://github.com/linux-sunxi/u-boot-sunxi/blob/06ad4bb62b69e886/arch/arm/cpu/armv7/sunxi/dram.c#L160
The "0x0735" is the configuration for DEBE. It means that it is enabled
(lowest bit is set) and configured for high priority (higher than CPU
and GPU). Playing around with it is easy, and we can see that these
settings really have some effect. For example, we can disable the DEBE
host port and see that there is no picture on the screen :-) Or we can
inverse priorities to make the CPU higher priority than DEBE, and see
that there is no significant memset performance drop anymore. The
abnormal memset performance results can be found in the Table 1 from
my old blog post:
    
http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10.html

So let's assume that the DEBE DMA is indeed doing isolated 32 byte
burst reads at regular intervals. And we have the CPU also doing
a long sequential read or write at the same time. The DDR3 memory
is organized as 8 banks, and the banks are interleaved. For the page
size 4096 in Cubieboard (two 2048 sized pages from two 16-bit dram
chips combined), any physical addresses that differ by a multiple of
32K bytes belong to the same bank. When both CPU and DEBE are trying
to access data in the same bank, they are fighting with each other to
keep their own page active in this particular bank. Switching the
active page in a bank is very expensive. This requires executing
PRECHARGE and ACTIVATE commands. In the worst case, each 32 byte burst
read originating from DEBE is going to cause a bank conflict and the
PRECHARGE/ACTIVATE overhead. This is a huge waste. The DEBE is creating
such bubbles at regular intervals, which fully occupy the DRAM
controller. Now an interesting question is, how much can be done
between these bubbles? Appears that by increasing the screen
resolution, the gaps between these bubbles eventually become so small,
that only one set of PRECHARGE, ACTIVATE, READ/WRITE commands can be
done by the CPU in between. It means that whenever we hit a bank
conflict, the CPU and DEBE become entangled and can't escape each
other. So it goes on like this:
1. DEBE tries to read 32 bytes, page miss, PRECHARGE/ACTIVATE overhead
2. CPU tries to write 32 bytes, page miss, PRECHARGE/ACTIVATE overhead
3. DEBE tries to read 32 bytes, page miss, PRECHARGE/ACTIVATE overhead
...
Because at each step the DEBE and CPU only advance 32 bytes forward, the
distance between them remains the same and page misses never end. The
speed of DEBE data fetch and the speed of memset are tied together
(at around ~500MB/s). The only escape would be to read at least 64
bytes. And I believe that the CPU is able to do this at least for
reads (because reads involve 64 byte cache line fills). On the other
hand, writes are bypassing cache when write-allocate is not in use.

I have also implemented a simplified software simulation (let's put
it under GPLv2 license for clarity) of this whole process:

#!/usr/bin/env ruby

def simulate_cpu_vs_debe_competition(args)
    dram_freq = args[:dram_freq]
    number_of_bursts_in_cpu_access = args[:number_of_bursts_in_cpu_access]
    backwards = args[:walk_memory_in_backwards_direction]

    xres = args[:xres]
    yres = args[:yres]
    hz   = args[:hz]

    # Cubieboard1 dram timings
    dramc = {
        :dram_freq      => dram_freq,
        :CL             => 6,
        :tRP            => 6,
        :tRCD           => 6,
        :bank_user      => ["CPU"] * 8,
        :page_size      => 4096,
        :burst_bytes    => 32,
        :cycle_counter  => 0,
        :byte_counters  => {"CPU" => 0, "DEBE" => 0}
    }

    def dramc_simulate_burst_transfer(dramc, user, backwards, number_of_bursts)
        bank_num = (dramc[:byte_counters][user] / dramc[:page_size]) % 8
        bank_num = 7 - bank_num if backwards
        if dramc[:bank_user][bank_num] != user then
            # Bank conflict
            dramc[:bank_user][bank_num] = user
            dramc[:cycle_counter] += dramc[:tRP] + dramc[:tRCD]
            dramc[:cycle_counter] += dramc[:CL] + 4 * number_of_bursts
        else
            # Nice sequential access in the same bank
            dramc[:cycle_counter] += 4 * number_of_bursts
        end
        dramc[:byte_counters][user] += number_of_bursts * dramc[:burst_bytes]
    end

    cycles_between_fb_reads = dram_freq * 1000000 /
                              (xres * yres * 4 * hz / dramc[:burst_bytes])

    fb_reads = 0
    1.upto(100000) {
        while dramc[:cycle_counter] / cycles_between_fb_reads >= fb_reads do
            # DEBE requests just one burst in forward direction
            dramc_simulate_burst_transfer(dramc, "DEBE", false, 1)
            fb_reads += 1
        end
        dramc_simulate_burst_transfer(dramc, "CPU", backwards,
                                      args[:number_of_bursts_in_cpu_access])
    }

    return {
        :theoretical_bandwidth => dramc[:dram_freq] * dramc[:burst_bytes] / 4,
        :cpu_bandwidth         => dramc[:byte_counters]["CPU"] *
                                  dramc[:dram_freq] / dramc[:cycle_counter],
        :debe_bandwidth        => dramc[:byte_counters]["DEBE"] *
                                  dramc[:dram_freq] / dramc[:cycle_counter],
        :cycles_between_fb_reads => cycles_between_fb_reads,
    }
end

# This is write

xres = (ARGV[0] or "1920").to_i
yres = (ARGV[1] or "1080").to_i

[60, 56, 50].each {|hz|
(360 .. 480).step(24) {|dram_freq|
    # Simulate memory read (two bursts for a 64 byte cache line fill)
    result = simulate_cpu_vs_debe_competition({
           :xres => xres, :yres => yres, :hz => hz,
           :dram_freq => dram_freq,
           :number_of_bursts_in_cpu_access => 2,
           :walk_memory_in_backwards_direction => true})
    printf("%dx%d-32@%dHz, dram_freq=%dMHz : forward read    = %4d MB/s\n",
           xres, yres, hz, dram_freq,
           result[:cpu_bandwidth])

    # Simulate ordinary memset (writes are bypassing caches without
    # write-allocate, so do just one 32 byte burst)
    result = simulate_cpu_vs_debe_competition({
           :xres => xres, :yres => yres, :hz => hz,
           :dram_freq => dram_freq,
           :number_of_bursts_in_cpu_access => 1,
           :walk_memory_in_backwards_direction => false})
    printf("%dx%d-32@%dHz, dram_freq=%dMHz : forward write   = %4d MB/s%s\n",
           xres, yres, hz, dram_freq,
           result[:cpu_bandwidth],
           result[:cpu_bandwidth] < 550 ? " (*)" : "")

    # Simulate backwards memset (writes are bypassing caches without
    # write-allocate, so do just one 32 byte burst)
    result = simulate_cpu_vs_debe_competition({
           :xres => xres, :yres => yres, :hz => hz,
           :dram_freq => dram_freq,
           :number_of_bursts_in_cpu_access => 1,
           :walk_memory_in_backwards_direction => true})
    printf("%dx%d-32@%dHz, dram_freq=%dMHz : backwards write = %4d MB/s\n",
           xres, yres, hz, dram_freq,
           result[:cpu_bandwidth])

    printf("-\n")
}}

-- 
You received this message because you are subscribed to the Google Groups 
"linux-sunxi" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to