Hello, Maybe many have already observed this phenomena. If you have a 1920x1080 monitor with 60Hz refresh rate and 32bpp color depth, then sometimes it is possible to see glitches, which looks like waves rolling all over over the screen. This happens when the CPU or the Mali GPU are doing lots of memory accesses.
Even if you have not seen it yet, an easy way to reproduce the problem would be to run https://github.com/ssvb/lima-memtester/ The cube is expected to just stay at the same spot and rotate, but on A10 and 1920x1080 60Hz monitor it is also jumping up and down wildly. So what's going on? Long story short. It looks like the DEBE DMA in Allwinner A10 is just doing isolated 32 byte burst reads at regular intervals to scan out the framebuffer and send this data over HDMI. How is it causing all the troubles? I'll try to explain. The DRAM controller has a number of host ports for different peripherals and some knobs to configure them. For example, it is possible to configure different priorities for these ports: http://rhombus-tech.net/allwinner_a10/A10_register_guide/A10_DRAMC/#index38h2 And this stuff in the u-boot sources is here: https://github.com/linux-sunxi/u-boot-sunxi/blob/06ad4bb62b69e886/arch/arm/cpu/armv7/sunxi/dram.c#L160 The "0x0735" is the configuration for DEBE. It means that it is enabled (lowest bit is set) and configured for high priority (higher than CPU and GPU). Playing around with it is easy, and we can see that these settings really have some effect. For example, we can disable the DEBE host port and see that there is no picture on the screen :-) Or we can inverse priorities to make the CPU higher priority than DEBE, and see that there is no significant memset performance drop anymore. The abnormal memset performance results can be found in the Table 1 from my old blog post: http://ssvb.github.io/2013/06/27/fullhd-x11-desktop-performance-of-the-allwinner-a10.html So let's assume that the DEBE DMA is indeed doing isolated 32 byte burst reads at regular intervals. And we have the CPU also doing a long sequential read or write at the same time. The DDR3 memory is organized as 8 banks, and the banks are interleaved. For the page size 4096 in Cubieboard (two 2048 sized pages from two 16-bit dram chips combined), any physical addresses that differ by a multiple of 32K bytes belong to the same bank. When both CPU and DEBE are trying to access data in the same bank, they are fighting with each other to keep their own page active in this particular bank. Switching the active page in a bank is very expensive. This requires executing PRECHARGE and ACTIVATE commands. In the worst case, each 32 byte burst read originating from DEBE is going to cause a bank conflict and the PRECHARGE/ACTIVATE overhead. This is a huge waste. The DEBE is creating such bubbles at regular intervals, which fully occupy the DRAM controller. Now an interesting question is, how much can be done between these bubbles? Appears that by increasing the screen resolution, the gaps between these bubbles eventually become so small, that only one set of PRECHARGE, ACTIVATE, READ/WRITE commands can be done by the CPU in between. It means that whenever we hit a bank conflict, the CPU and DEBE become entangled and can't escape each other. So it goes on like this: 1. DEBE tries to read 32 bytes, page miss, PRECHARGE/ACTIVATE overhead 2. CPU tries to write 32 bytes, page miss, PRECHARGE/ACTIVATE overhead 3. DEBE tries to read 32 bytes, page miss, PRECHARGE/ACTIVATE overhead ... Because at each step the DEBE and CPU only advance 32 bytes forward, the distance between them remains the same and page misses never end. The speed of DEBE data fetch and the speed of memset are tied together (at around ~500MB/s). The only escape would be to read at least 64 bytes. And I believe that the CPU is able to do this at least for reads (because reads involve 64 byte cache line fills). On the other hand, writes are bypassing cache when write-allocate is not in use. I have also implemented a simplified software simulation (let's put it under GPLv2 license for clarity) of this whole process: #!/usr/bin/env ruby def simulate_cpu_vs_debe_competition(args) dram_freq = args[:dram_freq] number_of_bursts_in_cpu_access = args[:number_of_bursts_in_cpu_access] backwards = args[:walk_memory_in_backwards_direction] xres = args[:xres] yres = args[:yres] hz = args[:hz] # Cubieboard1 dram timings dramc = { :dram_freq => dram_freq, :CL => 6, :tRP => 6, :tRCD => 6, :bank_user => ["CPU"] * 8, :page_size => 4096, :burst_bytes => 32, :cycle_counter => 0, :byte_counters => {"CPU" => 0, "DEBE" => 0} } def dramc_simulate_burst_transfer(dramc, user, backwards, number_of_bursts) bank_num = (dramc[:byte_counters][user] / dramc[:page_size]) % 8 bank_num = 7 - bank_num if backwards if dramc[:bank_user][bank_num] != user then # Bank conflict dramc[:bank_user][bank_num] = user dramc[:cycle_counter] += dramc[:tRP] + dramc[:tRCD] dramc[:cycle_counter] += dramc[:CL] + 4 * number_of_bursts else # Nice sequential access in the same bank dramc[:cycle_counter] += 4 * number_of_bursts end dramc[:byte_counters][user] += number_of_bursts * dramc[:burst_bytes] end cycles_between_fb_reads = dram_freq * 1000000 / (xres * yres * 4 * hz / dramc[:burst_bytes]) fb_reads = 0 1.upto(100000) { while dramc[:cycle_counter] / cycles_between_fb_reads >= fb_reads do # DEBE requests just one burst in forward direction dramc_simulate_burst_transfer(dramc, "DEBE", false, 1) fb_reads += 1 end dramc_simulate_burst_transfer(dramc, "CPU", backwards, args[:number_of_bursts_in_cpu_access]) } return { :theoretical_bandwidth => dramc[:dram_freq] * dramc[:burst_bytes] / 4, :cpu_bandwidth => dramc[:byte_counters]["CPU"] * dramc[:dram_freq] / dramc[:cycle_counter], :debe_bandwidth => dramc[:byte_counters]["DEBE"] * dramc[:dram_freq] / dramc[:cycle_counter], :cycles_between_fb_reads => cycles_between_fb_reads, } end # This is write xres = (ARGV[0] or "1920").to_i yres = (ARGV[1] or "1080").to_i [60, 56, 50].each {|hz| (360 .. 480).step(24) {|dram_freq| # Simulate memory read (two bursts for a 64 byte cache line fill) result = simulate_cpu_vs_debe_competition({ :xres => xres, :yres => yres, :hz => hz, :dram_freq => dram_freq, :number_of_bursts_in_cpu_access => 2, :walk_memory_in_backwards_direction => true}) printf("%dx%d-32@%dHz, dram_freq=%dMHz : forward read = %4d MB/s\n", xres, yres, hz, dram_freq, result[:cpu_bandwidth]) # Simulate ordinary memset (writes are bypassing caches without # write-allocate, so do just one 32 byte burst) result = simulate_cpu_vs_debe_competition({ :xres => xres, :yres => yres, :hz => hz, :dram_freq => dram_freq, :number_of_bursts_in_cpu_access => 1, :walk_memory_in_backwards_direction => false}) printf("%dx%d-32@%dHz, dram_freq=%dMHz : forward write = %4d MB/s%s\n", xres, yres, hz, dram_freq, result[:cpu_bandwidth], result[:cpu_bandwidth] < 550 ? " (*)" : "") # Simulate backwards memset (writes are bypassing caches without # write-allocate, so do just one 32 byte burst) result = simulate_cpu_vs_debe_competition({ :xres => xres, :yres => yres, :hz => hz, :dram_freq => dram_freq, :number_of_bursts_in_cpu_access => 1, :walk_memory_in_backwards_direction => true}) printf("%dx%d-32@%dHz, dram_freq=%dMHz : backwards write = %4d MB/s\n", xres, yres, hz, dram_freq, result[:cpu_bandwidth]) printf("-\n") }} -- You received this message because you are subscribed to the Google Groups "linux-sunxi" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
