I have been working on tuning vector transpose within groups of vregs.
The canonical approach is to make multiple passes across pairs of rows,
zipping row pairs first at the API element width, then at double SEW,
continuing to double SEW at each new pass until the width reaches VLEN/2
at the final pass.
That's all good for the first few passes as long as the SEW <= ELEN.
However,
once SEW > ELEN, the middle-end emits BIT_FIELD_REF, and that results in
stack spills of vregs followed by scalar loads. LLVM does much better, and
emits
vslide{up,down}.
There are two levels of dysfunction here:
1. Why spill & fill through the stack? Why not extract scalars directly
from vregs
directly into scalar regs?
2. Why involve scalar registers at all? Why not vslide or even vrgather,
using
temporary vregs as necessary?
The fatal deficiency seems to be that the backend lacks vec_extractNM
patterns
for mode M bigger than ELEN. Here are some ideas:
1. Define scalar modes M larger than DI mode. Aarch64 defines TI, OI, and
XI modes
for 128, 256, and 512-bit integers (all of which are wider than the
hardware supports). 2. Define vector modes M that are half, quarter,
eighth, ... width of vector mode N. That
can be done with mode iterators. We already have VLS_HALF and
VLS_QUARTER, but
there are no such iterators for the VLA modes. Note: there are no
fractional LMUL
modes defined for SEW=64, i.e., no RVVMF[248]DI.
Comments? Better ideas?
G