On 2016-11-02 13:11:19 +0200, Martin Storsjö wrote:
> On Wed, 2 Nov 2016, Janne Grunau wrote:
>
> >On 2016-10-19 22:18:43 +0300, Martin Storsjö wrote:
>
> >>+ /* We only need h + 7 lines, but the horizontal filter assumes an
> >> \
> >>+ * even number of rows, so filter h + 8 lines here. */
> >> \
> >>+ ff_vp9_put_##filter##sz##_h_neon(temp, 64,
> >> \
> >>+ src - 3 * src_stride, src_stride,
> >> \
> >>+ h + 8, mx, 0);
> >> \
> >>+ ff_vp9_##op##_##filter##sz##_v_neon(dst, dst_stride,
> >> \
> >>+ temp + 3 * 64, 64,
> >> \
> >>+ h, 0, my);
> >> \
> >
> >I guess you haven't tried implementing the diagonal mc fully in asm?
> >Probably not worth the effort since the only significant gain comes from
> >keeping the temporary values in registers which works on size 4 and 8 and
> >maybe 16 for aarch64 wich are probably not that important.
>
> I didn't try that fully, no. I did try an asm frontend just like this one,
> though, to avoid a separate vpush/vpop in each of them, and while it does
> save a little, I had to restructure the end of the filter function, which
> cost a little for the plain-horizontal/vertical functions. All in all I
> guess it would be a gain, but not really worth it I think. Being able to
> keep all the intermediates in registers would probably make it more
> worthwhile, but that would also be even more effort. Since you need 7 more
> rows than the actual input size, and the height can be width*2, I think it
> would be hard to keep the intermediates for anything more than size 4 in
> registers even if you'd do a handmade version for it.
Yes, the variable height complicates it further, I guess 8 and 16 (on
aarch64) works only if you merge the horizontal and vertical filter
completely. I don't think it is worth the effort and patch good as it is
anyway.
> >>+ vrhadd.u8 q3, q3, q11
> >>+ vst1.8 {q0}, [r0, :128]!
> >>+ vst1.8 {q1}, [r0, :128]!
> >>+ vst1.8 {q2}, [r0, :128]!
> >>+ vst1.8 {q3}, [r0, :128], r1
> >>+ subs r12, r12, #1
> >>+ bne 1b
> >>+ bx lr
> >>+endfunc
> >
> >Have you tried 4 d-register loads/stores? If it's not slower it's
> >preferable since it uses less instructions. Same for size 32 with the
> >additional advantange of not having to adjust the stride
>
> Oh, why didn't I remember those instructions? Yes, this gives a huge gain,
> like a 20% speedup in the 32/64 functions. Thanks!
More than expected and I almost didn't ask for it since I assumed you
had tried.
> >>+@ Evaluate the filter twice in parallel, from the inputs src1-src9
> >>into dst1-dst2
> >>+@ (src1-src8 into dst1, src2-src9 into dst2), adding idx2 separately
> >>+@ at the end with saturation
> >>+.macro convolve dst1, dst2, src1, src2, src3, src4, src5, src6, src7,
> >>src8, src9, idx1, idx2, tmp1, tmp2
> >>+ vmul.s16 \dst1, \src1, d0[0]
> >>+ vmul.s16 \dst2, \src2, d0[0]
> >>+ vmla.s16 \dst1, \src2, d0[1]
> >>+ vmla.s16 \dst2, \src3, d0[1]
> >>+ vmla.s16 \dst1, \src3, d0[2]
> >>+ vmla.s16 \dst2, \src4, d0[2]
> >>+.if \idx1 == 3
> >>+ vmla.s16 \dst1, \src4, d0[3]
> >>+ vmla.s16 \dst2, \src5, d0[3]
> >>+.else
> >>+ vmla.s16 \dst1, \src5, d1[0]
> >>+ vmla.s16 \dst2, \src6, d1[0]
> >>+.endif
> >>+ vmla.s16 \dst1, \src6, d1[1]
> >>+ vmla.s16 \dst2, \src7, d1[1]
> >>+ vmla.s16 \dst1, \src7, d1[2]
> >>+ vmla.s16 \dst2, \src8, d1[2]
> >>+ vmla.s16 \dst1, \src8, d1[3]
> >>+ vmla.s16 \dst2, \src9, d1[3]
> >>+.if \idx2 == 3
> >>+ vmul.s16 \tmp1, \src4, d0[3]
> >>+ vmul.s16 \tmp2, \src5, d0[3]
> >>+.else
> >>+ vmul.s16 \tmp1, \src5, d1[0]
> >>+ vmul.s16 \tmp2, \src6, d1[0]
> >>+.endif
> >>+ vqadd.s16 \dst1, \dst1, \tmp1
> >>+ vqadd.s16 \dst2, \dst2, \tmp2
> >>+.endm
> >
> >the negative terms could be accumulated in tmp1/2 I doubt it makes a
> >difference in practice but reduces data dependencies.
>
> This did actually turn out to help quite a lot on A53 (from 11707 cycles to
> 9649, for the 64v function). Very marginal slowdowns on A8 and A9 though,
> but clearly worthwhile doing.
A little surprising. Either vmla is really slow on the cortex-a53 or its
neon unit is contrary to reports in tech press at least partially
128-bit wide.
Janne
_______________________________________________
libav-devel mailing list
[email protected]
https://lists.libav.org/mailman/listinfo/libav-devel