On Tue, 22 Jan 2013 14:26:50 -0000 "Ben Avison" <[email protected]> wrote:
> On Tue, 22 Jan 2013 13:10:54 -0000, Siarhei Siamashka > <[email protected]> wrote: > > Just one thing looks a bit odd. > > > >> src_8888_8888 > >> > >> Before After > >> Mean StdDev Mean StdDev Confidence Change > >> M 57.0 0.2 89.2 0.5 100.0% +56.4% > > > > 89.2 MPix/s * 32bpp = ~357 MB/s > > > >> src_0565_0565 > >> > >> Before After > >> Mean StdDev Mean StdDev Confidence Change > >> M 90.7 0.4 133.5 0.7 100.0% +47.1% > > > > 133.5 MPix/s * 16bpp = ~267 MB/s > > > > Seems to be a much less efficient use of memory bandwidth here > > compared to src_8888_8888? > > I think what you're seeing here is the speed difference between the word- > aligned and misaligned code paths, because the M test cycles over many > starting X positions for the source buffer, but always uses X=1 for the > destination buffer. For 32bpp, all pixel positions are word-aligned, but > for 16bpp this will result in half the runs being misaligned. > > I tried tweaking the M test to force alignment or misalignment, to get > some comparative timings for src_0565_0565: > > Aligned: 169.8 Mpix/s (rather closer to 2* the 32bpp result) > Unaligned: 108.6 Mpix/s > > I'm open to suggestions as to how to improve the misaligned case. Early > in development, I compared the speed of doing LDM followed by in- > register shuffling with either ORR or PKH instructions against using > lots of unaligned LDRs, and the LDRs came out fastest by a small margin, > which is why that's what's used in my patch. I get the following results, when comparing to LDM based memcpy which performs reshuffling with ORR instructions memcpy (PIXMAN_DISABLE=arm-simd): src_0565_0565 = L1: 425.14 L2: 216.23 M:169.38 HT: 34.77 VT: 27.29 R: 26.47 RT: 9.45 src_8888_8888 = L1: 490.12 L2: 139.03 M: 91.23 HT: 26.22 VT: 20.27 R: 20.41 RT: 8.44 pixman-armv6: src_0565_0565 = L1: 404.15 L2: 157.19 M:136.94 HT: 54.44 VT: 48.20 R: 42.83 RT: 14.30 src_8888_8888 = L1: 352.31 L2: 144.24 M: 94.56 HT: 40.35 VT: 36.26 R: 34.27 RT: 13.34 The L1/L2 numbers are bogus, because ARM11 (and Cortex-A8 in default configuration) does not allocate data into cache on write misses. There is some code here to take read-allocate cache into account: http://cgit.freedesktop.org/pixman/tree/test/lowlevel-blt-bench.c?id=pixman-0.28.2#n178 But it uses 64 byte steps (which are bigger than 32 bytes cache line in ARM11), and also it does such cache "prefetch" only for the first scanline. The numbers for HT/VT/R/RT are much better for the armv6 code because these are testing blits with very small rectangles and per-scanline call overhead shows up for memcpy. Still the numbers for M test are interesting and demonstrate that using LDM with ORR arithmetic is better for unaligned reads. Actually LDM instruction only takes 1 cycle to issue on ARM11 regardless of the size of the registers list, and can continue to run in the background simultaneously with independent ALU instructions. Another reason to avoid unaligned reads are uncached framebuffers. Some people are trying to use them directly (not that it's a good idea): http://comments.gmane.org/gmane.comp.lib.cairo/23331 And unaligned uncached reads are going to be particularly slow. Please note that I'm not demanding for an immediate fix. But it would be nice to add this into a TODO list for the future ARMv6 improvements :-) -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
