On Sat, Nov 18, 2017 at 06:35:48PM +0100, Rafal Dabrowa wrote: > > This is a proposal of performance optimizations for 8-bit > hevc video decoding on aarch64 platform with neon (simd) extension. > > I'm testing my optimizations on NanoPi M3 device. I'm using > mainly "Big Buck Bunny" video file in format 1280x720 for testing. > The video file was pulled from libde265.org page, see > http://www.libde265.org/hevc-bitstreams/bbb-1280x720-cfg06.mkv > The movie duration is 00:10:34.53. > > Overall performance gain is about 2x. Without optimizations the movie > playback stops in practice after a few seconds. With > optimizations the file is played smoothly 99% of the time. > > For performance testing the following command was used: > > time ./ffmpeg -hide_banner -i ~/bbb-1280x720-cfg06.mkv -f yuv4mpegpipe - > >/dev/null > > The video file was pre-read before test to minimize disk reads during testing. > Program execution time without optimization was as follows: > > real 11m48.576s > user 43m8.111s > sys 0m12.469s > > Execution time with optimizations: > > real 6m17.046s > user 21m19.792s > sys 0m14.724s >
Can you post the results of checkasm --bench for hevc?
Did you run it to check for any calling convention violation?
>
> The patch contains optimizations for most heavily used qpel, epel, sao and
> idct
> functions. Among the functions provided for optimization there are two
> intensively used, but not optimized in this patch: hevc_v_loop_filter_luma_8
> and hevc_h_loop_filter_luma_8. I have no idea how they could be optimized
> hence I leaved them without optimizations.
>
You may want to check x86/hevc_deblock.asm then (no idea if these are
implemented).
[...]
> +function ff_hevc_put_hevc_pel_pixels4_8_neon, export=1
> + mov x7, 128
> +1: ld1 { v0.s }[0], [x1], x2
> + ushll v4.8h, v0.8b, 6
> + st1 { v4.d }[0], [x0], x7
using #128 not possible?
> + subs x3, x3, 1
> + b.ne 1b
> + ret
here and below: no use of the x6 register?
A few comments on the style:
- please use a consistent spacing (current function mismatches with later
code), preferably using a relatively large number of spaces as common
ground (check the other sources)
- we use capitalized size suffixes (B, H, ...); and IIRC the lower case
form are problematic with some assembler but don't quote me on that.
- we don't use spaces between {}
> +endfunc
> +
> +function ff_hevc_put_hevc_pel_pixels6_8_neon, export=1
> + mov x7, 120
> +1: ld1 { v0.8b }, [x1], x2
> + ushll v4.8h, v0.8b, 6
> + st1 { v4.d }[0], [x0], 8
I think you need to use # as prefix for the immediates
> + st1 { v4.s }[2], [x0], x7
I assume you can't use #120?
Have you checked if using #128 here and decrementing x0 afterward isn't
faster?
[...]
> +function ff_hevc_put_hevc_pel_bi_pixels32_8_neon, export=1
> + mov x10, 128
> +1: ld1 { v0.16b, v1.16b }, [x2], x3 // src
> + ushll v16.8h, v0.8b, 6
> + ushll2 v17.8h, v0.16b, 6
> + ushll v18.8h, v1.8b, 6
> + ushll2 v19.8h, v1.16b, 6
> + ld1 { v20.8h, v21.8h, v22.8h, v23.8h }, [x4], x10 // src2
> + sqadd v16.8h, v16.8h, v20.8h
> + sqadd v17.8h, v17.8h, v21.8h
> + sqadd v18.8h, v18.8h, v22.8h
> + sqadd v19.8h, v19.8h, v23.8h
> + sqrshrun v0.8b, v16.8h, 7
> + sqrshrun2 v0.16b, v17.8h, 7
> + sqrshrun v1.8b, v18.8h, 7
> + sqrshrun2 v1.16b, v19.8h, 7
does pairing helps here?
sqrshrun v0.8b, v16.8h, 7
sqrshrun v1.8b, v18.8h, 7
sqrshrun2 v0.16b, v17.8h, 7
sqrshrun2 v1.16b, v19.8h, 7
[...]
> + sqrshrun v0.8b, v16.8h, 7
> + sqrshrun2 v0.16b, v17.8h, 7
> + sqrshrun v1.8b, v18.8h, 7
> + sqrshrun2 v1.16b, v19.8h, 7
> + sqrshrun v2.8b, v20.8h, 7
> + sqrshrun2 v2.16b, v21.8h, 7
> + sqrshrun v3.8b, v22.8h, 7
> + sqrshrun2 v3.16b, v23.8h, 7
Again, this might be a good candidate for attempting to shuffle the
instructions and see if it helps (there are many other places, I picked
one randomly).
> +.Lepel_filters:
const/endconst + align might be better for all these labels
[...]
> +function ff_hevc_put_hevc_epel_hv12_8_neon, export=1
> + add x10, x3, 3
> + lsl x10, x10, 7
> + sub sp, sp, x10 // tmp_array
> + stp x0, x3, [sp, -16]!
> + stp x5, x30, [sp, -16]!
> + add x0, sp, 32
> + sub x1, x1, x2
> + add x3, x3, 3
> + bl ff_hevc_put_hevc_epel_h12_8_neon
> + ldp x5, x30, [sp], 16
> + ldp x0, x3, [sp], 16
> + load_epel_filterh x5, x4
> + mov x5, 112
> + mov x10, 128
> + ld1 { v16.8h, v17.8h }, [sp], x10
> + ld1 { v18.8h, v19.8h }, [sp], x10
> + ld1 { v20.8h, v21.8h }, [sp], x10
> +1: ld1 { v22.8h, v23.8h }, [sp], x10
> + calc_epelh v4, v16, v18, v20, v22
> + calc_epelh2 v4, v5, v16, v18, v20, v22
> + calc_epelh v5, v17, v19, v21, v23
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.eq 2f
> +
> + ld1 { v16.8h, v17.8h }, [sp], x10
> + calc_epelh v4, v18, v20, v22, v16
> + calc_epelh2 v4, v5, v18, v20, v22, v16
> + calc_epelh v5, v19, v21, v23, v17
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.eq 2f
> +
> + ld1 { v18.8h, v19.8h }, [sp], x10
> + calc_epelh v4, v20, v22, v16, v18
> + calc_epelh2 v4, v5, v20, v22, v16, v18
> + calc_epelh v5, v21, v23, v17, v19
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.eq 2f
> +
> + ld1 { v20.8h, v21.8h }, [sp], x10
> + calc_epelh v4, v22, v16, v18, v20
> + calc_epelh2 v4, v5, v22, v16, v18, v20
> + calc_epelh v5, v23, v17, v19, v21
> + st1 { v4.8h }, [x0], 16
> + st1 { v5.4h }, [x0], x5
> + subs x3, x3, 1
> + b.ne 1b
Introducing macros probably makes sense in these functions
[...]
> +8: b 9f // 0
> + nop
> + nop
> + nop
> + st1 { v29.b }[0], [x7] // 1
> + b 9f
> + nop
> + nop
> + st1 { v29.h }[0], [x7] // 2
> + b 9f
> + nop
> + nop
> + st1 { v29.h }[0], [x7], 2 // 3
> + st1 { v29.b }[2], [x7]
> + b 9f
> + nop
> + st1 { v29.s }[0], [x7] // 4
> + b 9f
> + nop
> + nop
> + st1 { v29.s }[0], [x7], 4 // 5
> + st1 { v29.b }[4], [x7]
> + b 9f
> + nop
> + st1 { v29.s }[0], [x7], 4 // 6
> + st1 { v29.h }[2], [x7]
> + b 9f
> + nop
> + st1 { v29.s }[0], [x7], 4 // 7
> + st1 { v29.h }[2], [x7], 2
> + st1 { v29.b }[6], [x7]
What are these nops for? align?
[...]
Anyway, can you split your patch? It's really a lot of code and there is
no way anyone can review it properly quickly.
I also think macros would be welcome in many places to reduce the size of
the patch(es).
Regards,
--
Clément B.
signature.asc
Description: PGP signature
_______________________________________________ ffmpeg-devel mailing list [email protected] http://ffmpeg.org/mailman/listinfo/ffmpeg-devel
