I think in most cases it is like this, but specifically for this function,
using Reduction only once would be slower.
The currently submitted version roughly takes:
pix_abs_0_0_rvv_i32: 136.2
The version that uses Reduction only once takes:
pix_abs_0_0_rvv_i32: 169.2
Here is the implementation of the version that uses it only once:
func ff_pix_abs16_temp_rvv, zve32x
vsetivli zero, 16, e32, m4, ta, ma
vmv.v.i v24, 0
vmv.s.x v0, zero
1:
vsetvli zero, zero, e8, m1, tu, ma
vle8.v v4, (a1)
vle8.v v12, (a2)
addi a4, a4, -1
vwsubu.vv v16, v4, v12
add a1, a1, a3
vwsubu.vv v20, v12, v4
vsetvli zero, zero, e16, m2, tu, ma
vmax.vv v16, v16, v20
add a2, a2, a3
vwadd.wv v24, v24, v16
bnez a4, 1b
vsetvli zero, zero, e32, m4, ta, ma
vwredsumu.vs v0, v24, v0
vmv.x.s a0, v0
ret
endfunc
Rémi Denis-Courmont <[email protected]> 于2024年2月7日周三 00:58写道:
> Hi,
>
> To sum a vector, you should only reduce once at the end of the function,
> c.f.
> how it's done in existing scalar products. Reduction instructions are
> (intrinsically) slow.
>
> --
> Rémi Denis-Courmont
> http://www.remlab.net/
>
>
>
>
_______________________________________________
ffmpeg-devel mailing list
[email protected]
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
[email protected] with subject "unsubscribe".