http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59078

            Bug ID: 59078
           Summary: autoincrement feature of NEON store instructions is
                    not used
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tir5c3 at yahoo dot co.uk

The following testcase, when compiled with 'g++ test.cc -O3 -mfpu=neon
--save-temps -c', produces very inefficient code.

=====

#include <arm_neon.h>

uint64_t* foo(uint64_t* x, uint32_t y)
{
     uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     vst1q_u64(x, d);
     x+=2;
     return x;
}

====

The resulting assembly:

====
_Z3fooPyj:
    push    {r4, r5, r6, r7}
    vdup.32    q8, r1
    add    r7, r0, #32
    add    r6, r0, #48
    add    r5, r0, #64
    add    r4, r0, #80
    add    r1, r0, #96
    add    r2, r0, #112
    mov    r3, r0
    adds    r0, r0, #128
    vst1.64    {d16-d17}, [r3:64]!
    vst1.64    {d16-d17}, [r3:64]
    vst1.64    {d16-d17}, [r7:64]
    vst1.64    {d16-d17}, [r6:64]
    vst1.64    {d16-d17}, [r5:64]
    vst1.64    {d16-d17}, [r4:64]
    vst1.64    {d16-d17}, [r1:64]
    vst1.64    {d16-d17}, [r2:64]
    pop    {r4, r5, r6, r7}
    bx    lr
====

The main problem is that pointer autoincrement feature of the vst1.64
instruction is not fully utilized. GCC apparently figures it out for the first
store, but it becomes confused later. I would expect GCC to produce the
following output:

====
_Z3fooPyj:
    vdup.32    q8, r1
        vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    vst1.64    {d16-d17}, [r0:64]!
    bx    lr
====

On unrolled loops GCC spills almost all registers to memory, which
causes two to three times worse performance compared to the optimal
version.

This bug has been tested on GCC 4.8.1. This email [1] suggests that mainline is
also affected.

[1]: http://gcc.gnu.org/ml/gcc-help/2013-11/msg00075.html

Reply via email to