http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59078
Bug ID: 59078 Summary: autoincrement feature of NEON store instructions is not used Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tir5c3 at yahoo dot co.uk The following testcase, when compiled with 'g++ test.cc -O3 -mfpu=neon --save-temps -c', produces very inefficient code. ===== #include <arm_neon.h> uint64_t* foo(uint64_t* x, uint32_t y) { uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y)); vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; vst1q_u64(x, d); x+=2; return x; } ==== The resulting assembly: ==== _Z3fooPyj: push {r4, r5, r6, r7} vdup.32 q8, r1 add r7, r0, #32 add r6, r0, #48 add r5, r0, #64 add r4, r0, #80 add r1, r0, #96 add r2, r0, #112 mov r3, r0 adds r0, r0, #128 vst1.64 {d16-d17}, [r3:64]! vst1.64 {d16-d17}, [r3:64] vst1.64 {d16-d17}, [r7:64] vst1.64 {d16-d17}, [r6:64] vst1.64 {d16-d17}, [r5:64] vst1.64 {d16-d17}, [r4:64] vst1.64 {d16-d17}, [r1:64] vst1.64 {d16-d17}, [r2:64] pop {r4, r5, r6, r7} bx lr ==== The main problem is that pointer autoincrement feature of the vst1.64 instruction is not fully utilized. GCC apparently figures it out for the first store, but it becomes confused later. I would expect GCC to produce the following output: ==== _Z3fooPyj: vdup.32 q8, r1 vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! vst1.64 {d16-d17}, [r0:64]! bx lr ==== On unrolled loops GCC spills almost all registers to memory, which causes two to three times worse performance compared to the optimal version. This bug has been tested on GCC 4.8.1. This email [1] suggests that mainline is also affected. [1]: http://gcc.gnu.org/ml/gcc-help/2013-11/msg00075.html