Hi Paul-Antoine,
This pattern enables the combine pass to merge a vec_duplicate into a plus-mult
or minus-mult RTL instruction.
Before this patch, we have two instructions, e.g.:
vfmv.v.f v6,fa0
vfmadd.vv v9,v6,v7
After, we get only one:
vfmadd.vf v9,fa0,v7
On SPEC2017's 503.bwaves_r, depending on the workload, the reduction in dynamic
instruction count varies from -4.66% to -4.75%.
The general issue with this kind of optimization (we have discussed it a few
times already) is that, depending on the uarch, we want the local combine
optimization that you show but not the fwprop/late-combine one where we
propagate a vector broadcast into a loop.
So IMHO in order to continue with this and similar patterns we need at least
accompanying rtx_cost handling that would allow us to tune per uarch.
Pan Li sent a similar patch for vadd.vv/vadd.vx I think in November and I
believe he intended to continue when stage 1 opens.
An outstanding question is how to distinguish the combine case from the
late-combine case. I haven't yet thought about that in detail.
--
Regards
Robin