https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119373
--- Comment #6 from Paul-Antoine Arras <parras at gcc dot gnu.org> --- (In reply to Robin Dapp from comment #5) > > The analysis of SPEC2017's 510.parest_r shows that the topmost basic block > > is a tight loop (see attached reducer). Once vectorised, by unrolling and > > mutualising 4 instructions, AArch64 achieves a 22% reduction in dynamic > > instruction count (DIC) within the block. However, RISC-V still vectorises > > but misses the opportunity to further unroll. > > > > The vectoriser dump for RISC-V shows the analysis fails for the natural mode > > RVVM1DF (and chooses RVVMF8QI instead) because it requires a "conversion not > > supported by target". It turns out this is caused by two missing standard > > named patterns: vec_unpacku_hi and vec_unpacku_lo. > > Why do you consider RVVM1DF a "natural" mode and not RVVMF8QI? As far as I > can see we do vectorize at full vector size > > vsetvli a5,a4,e64,m1,tu,ma Since the primary underlying scalar mode in the loop is DF, the autodetected vector mode returned by preferred_simd_mode is RVVM1DF. In comparison, AArch64 picks VNx2DF, which allows the vectorisation factor to be 8. By choosing RVVMF8QI, RISC-V is restricted to VF = 4. > Apart from that I don't see too many redundant instructions. We, > deliberately, don't define unpack_hi and unpack_lo because we don't have > directly matching instructions and because we prefer widening/narrowing with > the same number of elements rather than the same vector size. > > I suppose much of the icount difference is due to aarch64's complex > addressing modes. All of the loads here include an offset and a shift while > we need to do that explicitly. If we had similar addressing modes our > icount would surely be reduced by >30%. AArch64 saves 3 instructions per loop iteration thanks to its scaled addressing mode. But that is not enough to explain that the same basic block accounts for 79% more dynamic instructions on RISC-V compared to AArch64. > Regarding unrolling: We cannot/do no unroll those length-controlled VLA > loops. If we wanted unrolling we would need a VLS-like loop. Could you > detail what aarch64 gains by unrolling, i.e. which instructions get elided? I manually set the unrolling factor to 2 in the RISC-V backend and re-ran the 510.parest_r benchmark. This resulted in a 11.5% decrease in DIC and a 4.5% increase in code size, which I deem a fair tradeoff in -Ofast. The produced assembly (see attachment) has a long epilogue but that should not be an issue for any nontrivial trip count. Do you think this is a desirable optimisation to pursue?