https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92075

            Bug ID: 92075
           Summary: extracting element from NEON float-vector moves
                    to/from integer register
           Product: gcc
           Version: 9.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: matthijsvanduin at gmail dot com
  Target Milestone: ---

On ARM, when extracting an element from a float32x2_t expression, i.e.:

   float32x2_t v = (...);
   float v0 = v[0];

In most cases, gcc moves the element to an general-purpose register and back to
a VFP/Neon register, even at -Ofast. It doesn't seem to happen when v is an
argument or the return value of a non-inline function, but it does happen e.g.
when v is an arithmetic expression or produced by a NEON intrinsic or inline
asm. It doesn't seem to matter how v0 is consumed (i.e. by returning it,
passing it as argument to a function, or consuming it by inline asm).

Some test-cases:

#include <arm_neon.h>

float test1( float32x2_t v ) {
        return (v + v)[0];
}

void test2() {
        float32x2_t v;
        asm( "" : "=w"(v) );
        float v0 = v[0];
        asm( "" :: "w"(v0) );
}

void foo( float );
void test3( uint32x2_t v ) {
        foo( vcvt_n_f32_u32( v, 32 )[0] );
}

output produced by "arm-linux-gnueabihf-gcc-9 (Debian 9.2.1-8) 9.2.1 20190909"
with -Ofast -mcpu=cortex-a8 -mfpu=neon, reformatted for readability:

test1:
        vadd.f32  d0, d0, d0
        vmov.32   r3, d0[0]
        vmov      s0, r3
        bx        lr
test2:
        vmov.32  r3, d16[0]
        vmov     s15, r3
        bx       lr
test3:
        vcvt.f32.u32  d0, d0, #32
        vmov.32       r3, d0[0]
        vmov          s0, r3
        b             foo(PLT)

This is especially bad on the cortex-A8, where moving from a VFP/Neon register
to an general purpose register causes a severe pipeline stall.

Note btw how in test1 and test3 no move is needed at all: the final move
destination is the register it originally came from, and a different choice of
register allocation can make this also true in test2.

Reply via email to