https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92075
Bug ID: 92075 Summary: extracting element from NEON float-vector moves to/from integer register Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: matthijsvanduin at gmail dot com Target Milestone: --- On ARM, when extracting an element from a float32x2_t expression, i.e.: float32x2_t v = (...); float v0 = v[0]; In most cases, gcc moves the element to an general-purpose register and back to a VFP/Neon register, even at -Ofast. It doesn't seem to happen when v is an argument or the return value of a non-inline function, but it does happen e.g. when v is an arithmetic expression or produced by a NEON intrinsic or inline asm. It doesn't seem to matter how v0 is consumed (i.e. by returning it, passing it as argument to a function, or consuming it by inline asm). Some test-cases: #include <arm_neon.h> float test1( float32x2_t v ) { return (v + v)[0]; } void test2() { float32x2_t v; asm( "" : "=w"(v) ); float v0 = v[0]; asm( "" :: "w"(v0) ); } void foo( float ); void test3( uint32x2_t v ) { foo( vcvt_n_f32_u32( v, 32 )[0] ); } output produced by "arm-linux-gnueabihf-gcc-9 (Debian 9.2.1-8) 9.2.1 20190909" with -Ofast -mcpu=cortex-a8 -mfpu=neon, reformatted for readability: test1: vadd.f32 d0, d0, d0 vmov.32 r3, d0[0] vmov s0, r3 bx lr test2: vmov.32 r3, d16[0] vmov s15, r3 bx lr test3: vcvt.f32.u32 d0, d0, #32 vmov.32 r3, d0[0] vmov s0, r3 b foo(PLT) This is especially bad on the cortex-A8, where moving from a VFP/Neon register to an general purpose register causes a severe pipeline stall. Note btw how in test1 and test3 no move is needed at all: the final move destination is the register it originally came from, and a different choice of register allocation can make this also true in test2.