http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725

Richard Earnshaw <rearnsha at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2010-05-11 07:35:23         |2010-09-29 7:35:23
               date|                            |
                 CC|                            |rearnsha at gcc dot gnu.org

--- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2010-09-29 
16:28:17 UTC ---
So the compiler is correct not to be using vld1 for this code.  The memory
format of int32x4_t is defined to be the format of a neon register that has
been filled from an array of int32 values and then stored to memory using VSTM
(or equivalent sequence).  The implication of all this is that int32x4_t does
not (necessarily) have the same memory layout as int32_t[4].


arm_neon.h provides intrinsics for filling neon registers from arrays in
memory, and in this case I think you should be using these directly.  That is,
your macro should be modified to contain:

#define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v =
vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);}


There are still problems after doing this, however.  In particular the compiler
is not correctly tracking alias information for the load/store intrinsics,
which means it is unable to move stores past loads to reduce stalls in the
pipeline.

The stack wastage appears to be fixed in trunk gcc; at least I don't see any
stack allocation for your testcase.

I haven't looked into the scheduling issues at this time.

Reply via email to