I don't think loop vectorizer can do more optimization here.
GCC pass to vec_perm_const targethook vec_perm <,,(nunits - 1, nunits , nuits +
1, )>
to handle that. It's very target dependent. We can't do more about that.
For RVV, it's better transform this case into vec_extract + vec_shl_inse
Thanks Robin.
Send V2:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/638033.html
with adding changeLog since I realize changlog issue in V1:
gcc/ChangeLog:
* config/riscv/riscv-v.cc (shuffle_extract_and_slide1up_patterns):
(expand_vec_perm_const_1):
Tested on zvl128b/