https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373

--- Comment #6 from Paul-Antoine Arras <parras at gcc dot gnu.org> ---
(In reply to Robin Dapp from comment #5)
> > The analysis of SPEC2017's 510.parest_r shows that the topmost basic block
> > is a tight loop (see attached reducer). Once vectorised, by unrolling and
> > mutualising 4 instructions, AArch64 achieves a 22% reduction in dynamic
> > instruction count (DIC) within the block. However, RISC-V still vectorises
> > but misses the opportunity to further unroll.
> > 
> > The vectoriser dump for RISC-V shows the analysis fails for the natural mode
> > RVVM1DF (and chooses RVVMF8QI instead) because it requires a "conversion not
> > supported by target". It turns out this is caused by two missing standard
> > named patterns: vec_unpacku_hi and vec_unpacku_lo.
> 
> Why do you consider RVVM1DF a "natural" mode and not RVVMF8QI?  As far as I
> can see we do vectorize at full vector size
> 
>   vsetvli a5,a4,e64,m1,tu,ma 

Since the primary underlying scalar mode in the loop is DF, the autodetected
vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64
picks VNx2DF, which allows the vectorisation factor to be 8. By choosing
RVVMF8QI, RISC-V is restricted to VF = 4.

> Apart from that I don't see too many redundant instructions.  We,
> deliberately, don't define unpack_hi and unpack_lo because we don't have
> directly matching instructions and because we prefer widening/narrowing with
> the same number of elements rather than the same vector size.
> 
> I suppose much of the icount difference is due to aarch64's complex
> addressing modes.  All of the loads here include an offset and a shift while
> we need to do that explicitly.  If we had similar addressing modes our
> icount would surely be reduced by >30%.

AArch64 saves 3 instructions per loop iteration thanks to its scaled addressing
mode. But that is not enough to explain that the same basic block accounts for
79% more dynamic instructions on RISC-V compared to AArch64.

> Regarding unrolling: We cannot/do no unroll those length-controlled VLA
> loops.  If we wanted unrolling we would need a VLS-like loop.  Could you
> detail what aarch64 gains by unrolling, i.e. which instructions get elided?

I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the
510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5%
increase in code size, which I deem a fair tradeoff in -Ofast. The produced
assembly (see attachment) has a long epilogue but that should not be an issue
for any nontrivial trip count.

Do you think this is a desirable optimisation to pursue?

Reply via email to