Hi all, We can make use of the integrated rotate step of the XAR instruction to implement most vector integer rotates, as long we zero out one of the input registers for it. This allows for a lower-latency sequence than the fallback SHL+USRA, especially when we can hoist the zeroing operation away from loops and hot parts. We can also use it for 64-bit vectors as long as we zero the top half of the vector to be rotated. That should still be preferable to the default sequence. With this patch we can gerate for the input: v4si G1 (v4si r) { return (r >> 23) | (r << 9); }
v8qi G2 (v8qi r) { return (r << 3) | (r >> 5); } the assembly for +sve2: G1: movi v31.4s, 0 xar z0.s, z0.s, z31.s, #23 ret G2: movi v31.4s, 0 fmov d0, d0 xar z0.b, z0.b, z31.b, #5 ret instead of the current: G1: shl v31.4s, v0.4s, 9 usra v31.4s, v0.4s, 23 mov v0.16b, v31.16b ret G2: shl v31.8b, v0.8b, 3 usra v31.8b, v0.8b, 5 mov v0.8b, v31.8b ret Bootstrapped and tested on aarch64-none-linux-gnu. Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com> gcc/ * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add generation of XAR sequences when possible. gcc/testsuite/ * gcc.target/aarch64/rotate_xar_1.c: New test.
v2-0005-aarch64-Emit-XAR-for-vector-rotates-where-possible.patch
Description: v2-0005-aarch64-Emit-XAR-for-vector-rotates-where-possible.patch