https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057

--- Comment #4 from rdapp.gcc at gmail dot com ---
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057
>
> --- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> ---
> (In reply to Robin Dapp from comment #2)
>> I think depending on the performance of strided loads/stores this can be
>> profitable to vectorize.  Looks like we need loop versioning to account for
>> the possible aliasing but once this is out of the way we could be OK.
>> 
>> I have a local patch that uses strided stores here (in the limited example)
>> but that's GCC 16 material.
>
> I believe strided/indexed loads/stores are pretty expensive in most of the
> hardware. For example, we have tested 625 X264 reference.
>
> Clang use indexed load/store vectorize pixel_satd_8x4 wheras GCC is SLP
> vectorizing with small length unit-stride load/store.
>
> In K1:
> gcc-14 real 24m2629, clang-20 real 30m51.174s.
>
> Big performance drop from gcc-14 to clang-20.
>
> Compile option: -march=rv64gcv_zvl256b -mrvv-vector-bits=zvl,
> -mrvv-max-lmul=m2.

Yes, I agree that costing is not particularly easy with.  In particular given
the fragmentation of the microarchitectures and their very different
performance characteristics.

On the other hand, we have a local patch that speeds up x264 SATD
significantly on our uarch with the help of strided loads.

My impression is that we surely don't want to universally use strided loads
all the time and need to pay attention to make reasonable costing decisions but
there are cases where they help.

Reply via email to