Kyrylo Tkachov <ktkac...@nvidia.com> writes:
> Hi all,
>
> We can make use of the integrated rotate step of the XAR instruction
> to implement most vector integer rotates, as long we zero out one
> of the input registers for it. This allows for a lower-latency sequence
> than the fallback SHL+USRA, especially when we can hoist the zeroing operation
> away from loops and hot parts.
> We can also use it for 64-bit vectors as long
> as we zero the top half of the vector to be rotated. That should still be
> preferable to the default sequence.
Is the zeroing necessary? We don't expect/require that 64-bit vector
modes are maintained in zero-extended form, or that 64-bit ops act as
strict_lowparts, so it should be OK to take a paradoxical subreg.
Or we could just extend the patterns to 64-bit modes, to avoid the
punning.
> With this patch we can gerate for the input:
> v4si
> G1 (v4si r)
> {
> return (r >> 23) | (r << 9);
> }
>
> v8qi
> G2 (v8qi r)
> {
> return (r << 3) | (r >> 5);
> }
> the assembly for +sve2:
> G1:
> movi v31.4s, 0
> xar z0.s, z0.s, z31.s, #23
> ret
>
> G2:
> movi v31.4s, 0
> fmov d0, d0
> xar z0.b, z0.b, z31.b, #5
> ret
>
> instead of the current:
> G1:
> shl v31.4s, v0.4s, 9
> usra v31.4s, v0.4s, 23
> mov v0.16b, v31.16b
> ret
> G2:
> shl v31.8b, v0.8b, 3
> usra v31.8b, v0.8b, 5
> mov v0.8b, v31.8b
> ret
>
> Bootstrapped and tested on aarch64-none-linux-gnu.
>
> Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>
>
> gcc/
>
> * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add
> generation of XAR sequences when possible.
>
> gcc/testsuite/
>
> * gcc.target/aarch64/rotate_xar_1.c: New test.
> [...]
> +/*
> +** G1:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** xar v0\.2d, v([0-9]+)\.2d, v([0-9]+)\.2d, 39
FWIW, the (...) captures aren't necessary, since we never use backslash
references to them later.
Thanks,
Richard
> +** ret
> +*/
> +v2di
> +G1 (v2di r) {
> + return (r >> 39) | (r << 25);
> +}
> +
> +/*
> +** G2:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** xar z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #23
> +** ret
> +*/
> +v4si
> +G2 (v4si r) {
> + return (r >> 23) | (r << 9);
> +}
> +
> +/*
> +** G3:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** xar z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #5
> +** ret
> +*/
> +v8hi
> +G3 (v8hi r) {
> + return (r >> 5) | (r << 11);
> +}
> +
> +/*
> +** G4:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** xar z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #6
> +** ret
> +*/
> +v16qi
> +G4 (v16qi r)
> +{
> + return (r << 2) | (r >> 6);
> +}
> +
> +/*
> +** G5:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** fmov d[0-9]+, d[0-9]+
> +** xar z0\.s, z([0-9]+)\.s, z([0-9]+)\.s, #22
> +** ret
> +*/
> +v2si
> +G5 (v2si r) {
> + return (r >> 22) | (r << 10);
> +}
> +
> +/*
> +** G6:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** fmov d[0-9]+, d[0-9]+
> +** xar z0\.h, z([0-9]+)\.h, z([0-9]+)\.h, #7
> +** ret
> +*/
> +v4hi
> +G6 (v4hi r) {
> + return (r >> 7) | (r << 9);
> +}
> +
> +/*
> +** G7:
> +** movi? [vdz]([0-9]+)\.?(?:[0-9]*[bhsd])?, #?0
> +** fmov d[0-9]+, d[0-9]+
> +** xar z0\.b, z([0-9]+)\.b, z([0-9]+)\.b, #5
> +** ret
> +*/
> +v8qi
> +G7 (v8qi r)
> +{
> + return (r << 3) | (r >> 5);
> +}
> +