On Wed, 30 Mar 2022, Martin Storsjö wrote:
On Fri, 25 Mar 2022, Ben Avison wrote:
checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows.
vc1dsp.vc1_inv_trans_4x4_c: 158.2
vc1dsp.vc1_inv_trans_4x4_neon: 65.7
vc1dsp.vc1_inv_trans_4x4_dc_c: 86.5
vc1dsp.vc1_inv_trans_4x4_dc_neon: 26.5
vc1dsp.vc1_inv_trans_4x8_c: 335.2
vc1dsp.vc1_inv_trans_4x8_neon: 106.2
vc1dsp.vc1_inv_trans_4x8_dc_c: 151.2
vc1dsp.vc1_inv_trans_4x8_dc_neon: 25.5
vc1dsp.vc1_inv_trans_8x4_c: 365.7
vc1dsp.vc1_inv_trans_8x4_neon: 97.2
vc1dsp.vc1_inv_trans_8x4_dc_c: 139.7
vc1dsp.vc1_inv_trans_8x4_dc_neon: 16.5
vc1dsp.vc1_inv_trans_8x8_c: 547.7
vc1dsp.vc1_inv_trans_8x8_neon: 137.0
vc1dsp.vc1_inv_trans_8x8_dc_c: 268.2
vc1dsp.vc1_inv_trans_8x8_dc_neon: 30.5
Signed-off-by: Ben Avison <[email protected]>
---
libavcodec/aarch64/vc1dsp_init_aarch64.c | 19 +
libavcodec/aarch64/vc1dsp_neon.S | 678 +++++++++++++++++++++++
2 files changed, 697 insertions(+)
Looks generally reasonable. Is it possible to factorize out the individual
transforms (so that you'd e.g. invoke the same macro twice in the 8x8 and 4x4
functions) without too much loss? The downshift which differs between thw two
could either be left outside of the macro, or the downshift amount could be
made a macro parameter.
Another aspect: I forgot the aspect that we have existing arm assembly for
the idct. In some cases, there's value in keeping the implementations
similar if possible and relevant. But your implementation seems quite
straightforward, and seems to get better benchmark numbers on the same
cores, so I guess it's fine to diverge and add a new from-scratch
implementation here.
// Martin
_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".