Re: [libav-devel] [PATCHv2] aarch64: vp9: Add NEON optimizations of VP9 MC functions

Martin Storsjö Wed, 02 Nov 2016 06:23:34 -0700

On Wed, 2 Nov 2016, Martin Storsjö wrote:

This work is sponsored by, and copyright, Google.


These are ported from the ARM version; it is essentially a 1:1
port with no extra added features, but with some hand tuning
(especially for the plain copy/avg functions). The ARM version
isn't very register starved to begin with, so there's not much
to be gained from having more spare registers here - we only
avoid having to clobber callee-saved registers.

Examples of runtimes vs the 32 bit version, on a Cortex A53:
                                    ARM   AArch64
vp9_avg4_neon:                      32.2      23.7
vp9_avg8_neon:                      57.5      53.7
vp9_avg16_neon:                    168.6     165.4
vp9_avg32_neon:                    586.7     585.2
vp9_avg64_neon:                   2458.6    2325.9
vp9_avg_8tap_smooth_4h_neon:       130.7     124.0
vp9_avg_8tap_smooth_4hv_neon:      478.8     440.3
vp9_avg_8tap_smooth_4v_neon:       118.0      96.2
vp9_avg_8tap_smooth_8h_neon:       239.7     232.0
vp9_avg_8tap_smooth_8hv_neon:      691.3     649.9
vp9_avg_8tap_smooth_8v_neon:       238.0     214.5
vp9_avg_8tap_smooth_64h_neon:    11512.9   11492.8
vp9_avg_8tap_smooth_64hv_neon:   23322.1   23255.1
vp9_avg_8tap_smooth_64v_neon:    11556.2   11554.5
vp9_put4_neon:                      18.0      16.5
vp9_put8_neon:                      40.2      37.7
vp9_put16_neon:                     99.4      95.2
vp9_put32_neon:                    348.8     307.4
vp9_put64_neon:                   1321.3    1109.8
vp9_put_8tap_smooth_4h_neon:       124.7     117.3
vp9_put_8tap_smooth_4hv_neon:      465.8     425.3
vp9_put_8tap_smooth_4v_neon:       105.0      82.5
vp9_put_8tap_smooth_8h_neon:       227.7     218.2
vp9_put_8tap_smooth_8hv_neon:      661.4     620.1
vp9_put_8tap_smooth_8v_neon:       208.0     187.2
vp9_put_8tap_smooth_64h_neon:    10864.6   10873.9
vp9_put_8tap_smooth_64hv_neon:   21359.4   21295.7
vp9_put_8tap_smooth_64v_neon:     9629.1    9639.4

These are generally about as fast as the corresponding ARM
routines on the same CPU (at least on the A53), in most cases
marginally faster.

The speedup vs C code is pretty much the same as for the 32 bit
case; on the A53 it's around 6-13x for ther larger 8tap filters.
The exact speedup varies a little, since the C versions generally
don't end up exactly as slow/fast as on 32 bit.
---
v2: Updated according to the comments on the 32 bit version.
---
libavcodec/aarch64/Makefile              |   2 +
libavcodec/aarch64/vp9dsp_init_aarch64.c | 139 ++++++
libavcodec/aarch64/vp9mc_neon.S          | 733 +++++++++++++++++++++++++++++++
libavcodec/vp9.h                         |   1 +
libavcodec/vp9dsp.c                      |   2 +
5 files changed, 877 insertions(+)
create mode 100644 libavcodec/aarch64/vp9dsp_init_aarch64.c
create mode 100644 libavcodec/aarch64/vp9mc_neon.S

+function ff_vp9_copy64_neon, export=1
+1:
+        ldp             x5,  x6,  [x2]
+        stp             x5,  x6,  [x0]
+        ldp             x5,  x6,  [x2, #16]
+        stp             x5,  x6,  [x0, #16]
+        subs            w4,  w4,  #1
+        ldp             x5,  x6,  [x2, #32]
+        stp             x5,  x6,  [x0, #32]
+        ldp             x5,  x6,  [x2, #48]
+        stp             x5,  x6,  [x0, #48]
+        add             x2,  x2,  x3
+        add             x0,  x0,  x1
+        b.ne            1b
+        ret
+endfunc

I forgot to mention it anywhere, but the copy32 and copy64 functions don'tactually use any vector registers at all, but only plain aarch64 ldp/stp.When implemented with neon loads/stores, they ended up significantlyslower than the C version, on my dragonboard.

Currently copy64 runs at around 1100 cycles, while a trivial neon version(that loads all 64 bytes at once with a ld1 {v0,v1,v2,v3}) runs at around1600 cycles. One could of course play with all different combinations ofloading 16, 32 or 64 bytes per ld1 and scheduling them differently (IIRC Idid try some of those combinations at least), but I never got down to whatthe C version did unless I use ldp/stp.

Technically, having a _neon prefix for them is wrong, but anything else(omitting these two while hooking up avg32/avg64 separately) is morecomplication - although I'm open for suggestions on how to handle it best.


// Martin
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel

Re: [libav-devel] [PATCHv2] aarch64: vp9: Add NEON optimizations of VP9 MC functions

Reply via email to