https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82989
--- Comment #13 from Matthijs van Duin <matthijsvanduin at gmail dot com> ---
In case it's of interest, I did a quick benchmark of my testcase executed in a
loop on a cortex-a8:
Without neon:
12 instructions/iteration
14 cycles/iteration
With neon:
14 instructions/iteration
35.2-35.3 cycles/iteration
(This includes 4 instructions for the loop itself.)
When using neon, the majority of the time is spent in a nasty pipeline stall
for moving data from neon registers to arm registers, which takes a minimum of
20 cycles according to the cortex-a8 TRM.