https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
--- Comment #2 from Gael Guennebaud <gael.guennebaud at gmail dot com> --- Indeed, it fails to remove the dup only if the coefficient is used multiple times as in the following reduced exemple: (https://godbolt.org/z/hmSaE0) #include <arm_neon.h> void foo(const float* a, const float * b, float * c, int n) { float32x4_t c0, c1, c2, c3; c0 = vld1q_f32(c+0*4); c1 = vld1q_f32(c+1*4); for(int k=0; k<n; k++) { float32x4_t a0 = vld1q_f32(a+0*4+k*4); float32x4_t b0 = vld1q_f32(b+k*4); c0 = vfmaq_laneq_f32(c0, a0, b0, 0); c1 = vfmaq_laneq_f32(c1, a0, b0, 0); } vst1q_f32(c+0*4, c0); vst1q_f32(c+1*4, c1); } I tested with gcc 7 and 8.