http://gcc.gnu.org/bugzilla/show_bug.cgi?id=29294



--- Comment #10 from Siarhei Siamashka <siarhei.siamashka at gmail dot com> 
2012-12-20 05:47:30 UTC ---

(In reply to comment #9)



And some performance measurements (for working with L1 cache):



> $ arm-none-eabi-gcc-4.7.2 -O2 -mcpu=cortex-a8 -c test.c

> $ objdump -d test.o

> 

> 00000000 <fill>:

>    0:    e2511010     subs    r1, r1, #16

>    4:    412fff1e     bxmi    lr

>    8:    e2511010     subs    r1, r1, #16

>    c:    e1c020f0     strd    r2, [r0]

>   10:    e1c020f8     strd    r2, [r0, #8]

>   14:    e2800010     add    r0, r0, #16

>   18:    5afffffa     bpl    8 <fill+0x8>

>   1c:    e12fff1e     bx    lr



Cortex-A8  - 5   cycles per iteration

Cortex-A9  - 4.5 cycles per iteration

Cortex-A15 - 3   cycles per iteration



> $ arm-none-eabi-gcc-4.8.0 -O2 -mcpu=cortex-a8 -c test.c

> $ objdump -d test.o

> 

> 00000000 <fill>:

>    0:    e351000f     cmp    r1, #15

>    4:    d12fff1e     bxle    lr

>    8:    e2411010     sub    r1, r1, #16

>    c:    e280c010     add    ip, r0, #16

>   10:    e3c1100f     bic    r1, r1, #15

>   14:    e08c1001     add    r1, ip, r1

>   18:    e1c020f0     strd    r2, [r0]

>   1c:    e2800010     add    r0, r0, #16

>   20:    e14020f8     strd    r2, [r0, #-8]

>   24:    e1500001     cmp    r0, r1

>   28:    1afffffa     bne    18 <fill+0x18>

>   2c:    e12fff1e     bx    lr



Cortex-A8  - 6 cycles per iteration

Cortex-A9  - 4 cycles per iteration

Cortex-A15 - 3 cycles per iteration



While we could have expected something like the following code for the inner

loop:



1:      strd    V, [BUF], #8

        subs    N, N, #16

        strd    V, [BUF], #8

        bpl    1b



Cortex-A8  - 4 cycles per iteration

Cortex-A9  - 4 cycles per iteration

Cortex-A15 - 2.5 cycles per iteration

Reply via email to