http://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725
Richard Earnshaw <rearnsha at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2010-05-11 07:35:23 |2010-09-29 7:35:23 date| | CC| |rearnsha at gcc dot gnu.org --- Comment #1 from Richard Earnshaw <rearnsha at gcc dot gnu.org> 2010-09-29 16:28:17 UTC --- So the compiler is correct not to be using vld1 for this code. The memory format of int32x4_t is defined to be the format of a neon register that has been filled from an array of int32 values and then stored to memory using VSTM (or equivalent sequence). The implication of all this is that int32x4_t does not (necessarily) have the same memory layout as int32_t[4]. arm_neon.h provides intrinsics for filling neon registers from arrays in memory, and in this case I think you should be using these directly. That is, your macro should be modified to contain: #define X(n) {int32x4_t v; v = vld1q_s32((const int32_t*)&p[n]); v = vaddq_s32(v, a); v = vorrq_s32(v, b); vst1q_s32 ((int32_t*)&p[n], v);} There are still problems after doing this, however. In particular the compiler is not correctly tracking alias information for the load/store intrinsics, which means it is unable to move stores past loads to reduce stalls in the pipeline. The stack wastage appears to be fixed in trunk gcc; at least I don't see any stack allocation for your testcase. I haven't looked into the scheduling issues at this time.