On Fri, 3 Sep 2021, Martin Storsjö wrote:
+function \type\()_h264_qpel8_v_lowpass_neon_10 + ld1 {v16.8H}, [x1], x3 + ld1 {v18.8H}, [x1], x3 + ld1 {v20.8H}, [x1], x3 + ld1 {v22.8H}, [x1], x3 + ld1 {v24.8H}, [x1], x3 + ld1 {v26.8H}, [x1], x3 + ld1 {v28.8H}, [x1], x3 + ld1 {v30.8H}, [x1], x3 + ld1 {v17.8H}, [x1], x3 + ld1 {v19.8H}, [x1], x3 + ld1 {v21.8H}, [x1], x3 + ld1 {v23.8H}, [x1], x3 + ld1 {v25.8H}, [x1] + + transpose_8x8H v16, v18, v20, v22, v24, v26, v28, v30, v0, v1 + transpose_8x8H v17, v19, v21, v23, v25, v27, v29, v31, v0, v1 + lowpass_8_10 v16, v17, v18, v19, v16, v17 + lowpass_8_10 v20, v21, v22, v23, v18, v19 + lowpass_8_10 v24, v25, v26, v27, v20, v21 + lowpass_8_10 v28, v29, v30, v31, v22, v23 + transpose_8x8H v16, v17, v18, v19, v20, v21, v22, v23, v0, v1I'm a bit surprised by doing this kind of vertical filtering by transposing and doing it horizontally - when vertical filtering can be done so efficiently as-is without needing any extra 'ext' instructions and such. But I see that the existing code does it this way. I'll give it a try to make a PoC of rewriting the existing code for some case to see how it behaves without the transposes.
The potential speedups for the vertical filters are huge actually; I've sent a patch that IMO simplifies this (getting rid of all transposes). I'd appreciate if you'd remodel your patch according to it.
// Martin _______________________________________________ ffmpeg-devel mailing list [email protected] https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email [email protected] with subject "unsubscribe".
