Hi all,
We can make use of the integrated rotate step of the XAR instruction
to implement most vector integer rotates, as long we zero out one
of the input registers for it. This allows for a lower-latency sequence
than the fallback SHL+USRA, especially when we can hoist the zeroing operation
awa
Kyrylo Tkachov writes:
> Hi all,
>
> We can make use of the integrated rotate step of the XAR instruction
> to implement most vector integer rotates, as long we zero out one
> of the input registers for it. This allows for a lower-latency sequence
> than the fallback SHL+USRA, especially when we
Hi all,
We can make use of the integrated rotate step of the XAR instruction
to implement most vector integer rotates, as long we zero out one
of the input registers for it. This allows for a lower-latency sequence
than the fallback SHL+USRA, especially when we can hoist the zeroing operation
awa