https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
Wilco <wilco at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |wilco at gcc dot gnu.org --- Comment #1 from Wilco <wilco at gcc dot gnu.org> --- (In reply to Gael Guennebaud from comment #0) > vfmaq_laneq_f32 is currently implemented as: > > __extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) > vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b, > float32x4_t __c, const int __lane) > { > return __builtin_aarch64_fmav4sf (__b, > __aarch64_vdupq_laneq_f32 (__c, __lane), > __a); > } > > thus leading to unoptimized code as: > > ldr q1, [x2, 16] > dup v28.4s, v1.s[0] > dup v27.4s, v1.s[1] > dup v26.4s, v1.s[2] > dup v1.4s, v1.s[3] > fmla v22.4s, v25.4s, v28.4s > fmla v3.4s, v25.4s, v27.4s > fmla v6.4s, v25.4s, v26.4s > fmla v17.4s, v25.4s, v1.4s > > instead of: > > ldr q1, [x2, 16] > fmla v22.4s, v25.4s, v1.s[0] > fmla v3.4s, v25.4s, v1.s[1] > fmla v6.4s, v25.4s, v1.s[2] > fmla v17.4s, v25.4s, v1.s[3] > > I guess several other *lane* intrinsics exhibit the same shortcoming. Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2: #include "arm_neon.h" float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c) { a = vfmaq_laneq_f32 (a, b, c, 0); a = vfmaq_laneq_f32 (a, b, c, 1); return a; } fmla v0.4s, v1.4s, v2.4s[0] fmla v0.4s, v1.4s, v2.4s[1] ret In all cases the optimizer is able to merge the dups as expected. If it still fails for you, could you provide a compilable example like above that shows the issue? > For the record, I managed to partly workaround this issue by writing my own > version as: > > if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w" > (a), "w" (b) : ); > > but that's of course not ideal. This change yields a 32% speed up in Eigen's > matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633 I'd strongly advise against using inline assembler since most people make mistakes writing it, and GCC won't be able to optimize code using inline assembler.