https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057
--- Comment #3 from JuzheZhong <juzhe.zhong at rivai dot ai> --- (In reply to Robin Dapp from comment #2) > I think depending on the performance of strided loads/stores this can be > profitable to vectorize. Looks like we need loop versioning to account for > the possible aliasing but once this is out of the way we could be OK. > > I have a local patch that uses strided stores here (in the limited example) > but that's GCC 16 material. I believe strided/indexed loads/stores are pretty expensive in most of the hardware. For example, we have tested 625 X264 reference. Clang use indexed load/store vectorize pixel_satd_8x4 wheras GCC is SLP vectorizing with small length unit-stride load/store. In K1: gcc-14 real 24m2629, clang-20 real 30m51.174s. Big performance drop from gcc-14 to clang-20. Compile option: -march=rv64gcv_zvl256b -mrvv-vector-bits=zvl, -mrvv-max-lmul=m2.