https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101
Wilco <wilco at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |wilco at gcc dot gnu.org
--- Comment #1 from Wilco <wilco at gcc dot gnu.org> ---
(In reply to Gael Guennebaud from comment #0)
> vfmaq_laneq_f32 is currently implemented as:
>
> __extension__ static __inline float32x4_t __attribute__ ((__always_inline__))
> vfmaq_laneq_f32 (float32x4_t __a, float32x4_t __b,
> float32x4_t __c, const int __lane)
> {
> return __builtin_aarch64_fmav4sf (__b,
> __aarch64_vdupq_laneq_f32 (__c, __lane),
> __a);
> }
>
> thus leading to unoptimized code as:
>
> ldr q1, [x2, 16]
> dup v28.4s, v1.s[0]
> dup v27.4s, v1.s[1]
> dup v26.4s, v1.s[2]
> dup v1.4s, v1.s[3]
> fmla v22.4s, v25.4s, v28.4s
> fmla v3.4s, v25.4s, v27.4s
> fmla v6.4s, v25.4s, v26.4s
> fmla v17.4s, v25.4s, v1.4s
>
> instead of:
>
> ldr q1, [x2, 16]
> fmla v22.4s, v25.4s, v1.s[0]
> fmla v3.4s, v25.4s, v1.s[1]
> fmla v6.4s, v25.4s, v1.s[2]
> fmla v17.4s, v25.4s, v1.s[3]
>
> I guess several other *lane* intrinsics exhibit the same shortcoming.
Which compiler version did you use? I tried this on GCC6, 7, 8, and 9 with -O2:
#include "arm_neon.h"
float32x4_t f(float32x4_t a, float32x4_t b, float32x4_t c)
{
a = vfmaq_laneq_f32 (a, b, c, 0);
a = vfmaq_laneq_f32 (a, b, c, 1);
return a;
}
fmla v0.4s, v1.4s, v2.4s[0]
fmla v0.4s, v1.4s, v2.4s[1]
ret
In all cases the optimizer is able to merge the dups as expected.
If it still fails for you, could you provide a compilable example like above
that shows the issue?
> For the record, I managed to partly workaround this issue by writing my own
> version as:
>
> if(LaneID==0) asm("fmla %0.4s, %1.4s, %2.s[0]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
> else if(LaneID==1) asm("fmla %0.4s, %1.4s, %2.s[1]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
> else if(LaneID==2) asm("fmla %0.4s, %1.4s, %2.s[2]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
> else if(LaneID==3) asm("fmla %0.4s, %1.4s, %2.s[3]\n" : "+w" (c) : "w"
> (a), "w" (b) : );
>
> but that's of course not ideal. This change yields a 32% speed up in Eigen's
> matrix product: http://eigen.tuxfamily.org/bz/show_bug.cgi?id=1633
I'd strongly advise against using inline assembler since most people make
mistakes writing it, and GCC won't be able to optimize code using inline
assembler.