Re: [PATCH v3] RISC-V: Fix code gen for reduction with length 0 [PR118182]

Jeff Law Thu, 26 Dec 2024 16:16:20 -0800



On 12/24/24 2:24 AM, Robin Dapp wrote:

`.MASK_LEN_FOLD_LEFT_PLUS`(or `mask_len_fold_left_plus_m`) is expecting the
return value will be the start value even if the length is 0.

However current code gen in RISC-V backend is not meet that semantic, it will
result a random garbage value if length is 0.

Let example by current code gen for MASK_LEN_FOLD_LEFT_PLUS with f64:
         # _148 = .MASK_LEN_FOLD_LEFT_PLUS (stmp__148.33_134, vect__70.32_138, 
{ -1, ... }, loop_len_161, 0);
         vsetvli zero,a5,e64,m1,ta,ma
         vfmv.s.f        v2,fa5     # insn 1
         vfredosum.vs    v1,v1,v2   # insn 2
         vfmv.f.s        fa5,v1     # insn 3

insn 1:
- vfmv.s.f won't do anything if VL=0, which means v2 will contain garbage value.
insn 2:
- vfredosum.vs won't do anything if VL=0, and keep vd unchanged even TA.
(v-spec say: `If vl=0, no operation is performed and the destination register
  is not updated.`)
insn 3:
- vfmv.f.s will move the value from v1 even VL=0, so this is safe.

So how we fix that? we need two fix for that:

1. insn 1: need always execute with VL=1, so that we can guarantee it will
            always work as expect.
2. insn 2: Add new pattern to force `vd` use same reg as `vs1` (start value) for
            all reduction patterns, then we can guarantee vd[0] will contain the
            start value when vl=0

For 1, it's just a simple change to riscv_vector::expand_reduction, but for 2,
we have to add _AV variant reduction to force `vd` use same reg as `vs1` (start
value).


I think I'm OK with both approaches.  If the second one doesn't negatively
affect register pressure it certainly looks a bit nicer.  On the other hand I
would expect the reduction to dominate latency anyway so an additional vsetvl
shouldn't really matter.

Another thing to consider - can we even hit the vl=0 case with the regular
reductions?  They are VLMAX only so just affected by when we enable SELECT_VL
which we only do with one rgroup so vl=0 shouldn't happen (as opposed to
fold left).  I still need to convince myself that this is correct but the fact
that you only saw fallout with fold left partially confirms it.

No strong opinions either. I had to deal with something in this space awhile back on another design. I don't remember the impedance mismatchbetween the fold-left GCC semantics and the chip, but it was solvablewith a single copy and (IIRC) a mask which was all set up in the expander.

In general I wouldn't expect a fold-left reduction to be a significantperformance factor -- you still have to do the ops in order which is thperformance killer. But supporting fold-left was still useful, even iffor no other reason than allowing other parts of a loop to vectorize,even if the reduction step was a performance bottleneck.


jeff

Re: [PATCH v3] RISC-V: Fix code gen for reduction with length 0 [PR118182]

Reply via email to