http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29294
--- Comment #10 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 2012-12-20 05:47:30 UTC --- (In reply to comment #9) And some performance measurements (for working with L1 cache): > $ arm-none-eabi-gcc-4.7.2 -O2 -mcpu=cortex-a8 -c test.c > $ objdump -d test.o > > 00000000 <fill>: > 0: e2511010 subs r1, r1, #16 > 4: 412fff1e bxmi lr > 8: e2511010 subs r1, r1, #16 > c: e1c020f0 strd r2, [r0] > 10: e1c020f8 strd r2, [r0, #8] > 14: e2800010 add r0, r0, #16 > 18: 5afffffa bpl 8 <fill+0x8> > 1c: e12fff1e bx lr Cortex-A8 - 5 cycles per iteration Cortex-A9 - 4.5 cycles per iteration Cortex-A15 - 3 cycles per iteration > $ arm-none-eabi-gcc-4.8.0 -O2 -mcpu=cortex-a8 -c test.c > $ objdump -d test.o > > 00000000 <fill>: > 0: e351000f cmp r1, #15 > 4: d12fff1e bxle lr > 8: e2411010 sub r1, r1, #16 > c: e280c010 add ip, r0, #16 > 10: e3c1100f bic r1, r1, #15 > 14: e08c1001 add r1, ip, r1 > 18: e1c020f0 strd r2, [r0] > 1c: e2800010 add r0, r0, #16 > 20: e14020f8 strd r2, [r0, #-8] > 24: e1500001 cmp r0, r1 > 28: 1afffffa bne 18 <fill+0x18> > 2c: e12fff1e bx lr Cortex-A8 - 6 cycles per iteration Cortex-A9 - 4 cycles per iteration Cortex-A15 - 3 cycles per iteration While we could have expected something like the following code for the inner loop: 1: strd V, [BUF], #8 subs N, N, #16 strd V, [BUF], #8 bpl 1b Cortex-A8 - 4 cycles per iteration Cortex-A9 - 4 cycles per iteration Cortex-A15 - 2.5 cycles per iteration