Hi all,

We can make use of the integrated rotate step of the XAR instruction
to implement most vector integer rotates, as long we zero out one
of the input registers for it.  This allows for a lower-latency sequence
than the fallback SHL+USRA, especially when we can hoist the zeroing operation
away from loops and hot parts.  We can also use it for 64-bit vectors as long
as we zero the top half of the vector to be rotated.  That should still be
preferable to the default sequence.
With this patch we can gerate for the input:
v4si
G1 (v4si r)
{
    return (r >> 23) | (r << 9);
}

v8qi
G2 (v8qi r)
{
  return (r << 3) | (r >> 5);
}
the assembly for +sve2:
G1:
        movi    v31.4s, 0
        xar     z0.s, z0.s, z31.s, #23
        ret

G2:
        movi    v31.4s, 0
        fmov    d0, d0
        xar     z0.b, z0.b, z31.b, #5
        ret

instead of the current:
G1:
        shl     v31.4s, v0.4s, 9
        usra    v31.4s, v0.4s, 23
        mov     v0.16b, v31.16b
        ret
G2:
        shl     v31.8b, v0.8b, 3
        usra    v31.8b, v0.8b, 5
        mov     v0.8b, v31.8b
        ret

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov <ktkac...@nvidia.com>

gcc/

        * config/aarch64/aarch64.cc (aarch64_emit_opt_vec_rotate): Add
        generation of XAR sequences when possible.

gcc/testsuite/

        * gcc.target/aarch64/rotate_xar_1.c: New test.

Attachment: v2-0005-aarch64-Emit-XAR-for-vector-rotates-where-possible.patch
Description: v2-0005-aarch64-Emit-XAR-for-vector-rotates-where-possible.patch

Reply via email to