http://gcc.gnu.org/bugzilla/show_bug.cgi?id=51980
mgretton at gcc dot gnu.org changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |mgretton at gcc dot gnu.org --- Comment #7 from mgretton at gcc dot gnu.org --- Testing the testcase in #4 with a recent trunk (gcc version 4.9.0 20130528 (experimental) (GCC)) gives the following results: arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=softfp -O2 -mthumb: sqrlen4D_16u8: vmov d18, r0, r1 @ v16qi vmov d19, r2, r3 vld1.64 {d16-d17}, [sp:64] vabd.u8 q8, q9, q8 vmull.u8 q9, d16, d16 vmull.u8 q8, d17, d17 vuzp.32 q9, q8 vpaddl.u16 q9, q9 vmov q10, q9 @ v4si vpadal.u16 q10, q8 vmov r0, r1, d20 @ v4si vmov r2, r3, d21 bx lr arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=hard -O2 -mthumb: sqrlen4D_16u8: vabd.u8 q1, q0, q1 vmull.u8 q0, d2, d2 vmull.u8 q8, d3, d3 vuzp.32 q0, q8 vpaddl.u16 q0, q0 vpadal.u16 q0, q8 bx lr So code generation seems to be OK for hard-float ABI but the soft-float version has some issues with an extra vmov between the vpaddl and vpadal.