Hi Paul-Antoine,

This pattern enables the combine pass to merge a vec_duplicate into a plus-mult
or minus-mult RTL instruction.

Before this patch, we have two instructions, e.g.:
  vfmv.v.f        v6,fa0
  vfmadd.vv       v9,v6,v7

After, we get only one:
  vfmadd.vf       v9,fa0,v7

On SPEC2017's 503.bwaves_r, depending on the workload, the reduction in dynamic
instruction count varies from -4.66% to -4.75%.

The general issue with this kind of optimization (we have discussed it a few times already) is that, depending on the uarch, we want the local combine optimization that you show but not the fwprop/late-combine one where we propagate a vector broadcast into a loop.

So IMHO in order to continue with this and similar patterns we need at least accompanying rtx_cost handling that would allow us to tune per uarch.

Pan Li sent a similar patch for vadd.vv/vadd.vx I think in November and I believe he intended to continue when stage 1 opens.

An outstanding question is how to distinguish the combine case from the late-combine case. I haven't yet thought about that in detail.

--
Regards
Robin

Reply via email to