On Wed, 2 Nov 2016, Martin Storsjö wrote:
This work is sponsored by, and copyright, Google.These are ported from the ARM version; it is essentially a 1:1 port with no extra added features, but with some hand tuning (especially for the plain copy/avg functions). The ARM version isn't very register starved to begin with, so there's not much to be gained from having more spare registers here - we only avoid having to clobber callee-saved registers. Examples of runtimes vs the 32 bit version, on a Cortex A53: ARM AArch64 vp9_avg4_neon: 32.2 23.7 vp9_avg8_neon: 57.5 53.7 vp9_avg16_neon: 168.6 165.4 vp9_avg32_neon: 586.7 585.2 vp9_avg64_neon: 2458.6 2325.9 vp9_avg_8tap_smooth_4h_neon: 130.7 124.0 vp9_avg_8tap_smooth_4hv_neon: 478.8 440.3 vp9_avg_8tap_smooth_4v_neon: 118.0 96.2 vp9_avg_8tap_smooth_8h_neon: 239.7 232.0 vp9_avg_8tap_smooth_8hv_neon: 691.3 649.9 vp9_avg_8tap_smooth_8v_neon: 238.0 214.5 vp9_avg_8tap_smooth_64h_neon: 11512.9 11492.8 vp9_avg_8tap_smooth_64hv_neon: 23322.1 23255.1 vp9_avg_8tap_smooth_64v_neon: 11556.2 11554.5 vp9_put4_neon: 18.0 16.5 vp9_put8_neon: 40.2 37.7 vp9_put16_neon: 99.4 95.2 vp9_put32_neon: 348.8 307.4 vp9_put64_neon: 1321.3 1109.8 vp9_put_8tap_smooth_4h_neon: 124.7 117.3 vp9_put_8tap_smooth_4hv_neon: 465.8 425.3 vp9_put_8tap_smooth_4v_neon: 105.0 82.5 vp9_put_8tap_smooth_8h_neon: 227.7 218.2 vp9_put_8tap_smooth_8hv_neon: 661.4 620.1 vp9_put_8tap_smooth_8v_neon: 208.0 187.2 vp9_put_8tap_smooth_64h_neon: 10864.6 10873.9 vp9_put_8tap_smooth_64hv_neon: 21359.4 21295.7 vp9_put_8tap_smooth_64v_neon: 9629.1 9639.4 These are generally about as fast as the corresponding ARM routines on the same CPU (at least on the A53), in most cases marginally faster. The speedup vs C code is pretty much the same as for the 32 bit case; on the A53 it's around 6-13x for ther larger 8tap filters. The exact speedup varies a little, since the C versions generally don't end up exactly as slow/fast as on 32 bit. --- v2: Updated according to the comments on the 32 bit version. --- libavcodec/aarch64/Makefile | 2 + libavcodec/aarch64/vp9dsp_init_aarch64.c | 139 ++++++ libavcodec/aarch64/vp9mc_neon.S | 733 +++++++++++++++++++++++++++++++ libavcodec/vp9.h | 1 + libavcodec/vp9dsp.c | 2 + 5 files changed, 877 insertions(+) create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c create mode 100644 libavcodec/aarch64/vp9mc_neon.S
+function ff_vp9_copy64_neon, export=1 +1: + ldp x5, x6, [x2] + stp x5, x6, [x0] + ldp x5, x6, [x2, #16] + stp x5, x6, [x0, #16] + subs w4, w4, #1 + ldp x5, x6, [x2, #32] + stp x5, x6, [x0, #32] + ldp x5, x6, [x2, #48] + stp x5, x6, [x0, #48] + add x2, x2, x3 + add x0, x0, x1 + b.ne 1b + ret +endfunc
I forgot to mention it anywhere, but the copy32 and copy64 functions don't actually use any vector registers at all, but only plain aarch64 ldp/stp. When implemented with neon loads/stores, they ended up significantly slower than the C version, on my dragonboard.
Currently copy64 runs at around 1100 cycles, while a trivial neon version (that loads all 64 bytes at once with a ld1 {v0,v1,v2,v3}) runs at around 1600 cycles. One could of course play with all different combinations of loading 16, 32 or 64 bytes per ld1 and scheduling them differently (IIRC I did try some of those combinations at least), but I never got down to what the C version did unless I use ldp/stp.
Technically, having a _neon prefix for them is wrong, but anything else (omitting these two while hooking up avg32/avg64 separately) is more complication - although I'm open for suggestions on how to handle it best.
// Martin _______________________________________________ libav-devel mailing list [email protected] https://lists.libav.org/mailman/listinfo/libav-devel
