Hi Juzhe,
thanks looks pretty comprehensive already.
> +(define_expand "vec_perm<mode>"
> + [(match_operand:V 0 "register_operand")
> + (match_operand:V 1 "register_operand")
> + (match_operand:V 2 "register_operand")
> + (match_operand:<VINDEX> 3 "vector_perm_operand")]
> + "TARGET_VECTOR && GET_MODE_NUNITS (<MODE>mode).is_constant ()"
> + {
> + riscv_vector::expand_vec_perm (operands[0], operands[1],
> + operands[2], operands[3]);
> + DONE;
> + }
> +)
IMHO this would be the perfect use of expand_vec_perm (operands)
instead of the individual arguments (also the way aarch64 does it)
but your call in the end.
> +/* Return true if VEC is a constant in which every element is in the range
> + [MINVAL, MAXVAL]. The elements do not need to have the same value. */
> +
> +static bool
> +const_vec_all_in_range_p (rtx vec, HOST_WIDE_INT minval, HOST_WIDE_INT
> maxval)
> +{
> + if (!CONST_VECTOR_P (vec)
> + || GET_MODE_CLASS (GET_MODE (vec)) != MODE_VECTOR_INT)
> + return false;
> +
> + int nunits;
> + if (!CONST_VECTOR_STEPPED_P (vec))
> + nunits = const_vector_encoded_nelts (vec);
> + else if (!CONST_VECTOR_NUNITS (vec).is_constant (&nunits))
> + return false;
> +
> + for (int i = 0; i < nunits; i++)
> + {
> + rtx vec_elem = CONST_VECTOR_ELT (vec, i);
> + if (!CONST_INT_P (vec_elem)
> + || !IN_RANGE (INTVAL (vec_elem), minval, maxval))
> + return false;
> + }
> + return true;
> +}
> +
> +/* Return a const_int vector of VAL. */
> +
> +static rtx
> +gen_const_vector_dup (machine_mode mode, HOST_WIDE_INT val)
> +{
> + rtx c = gen_int_mode (val, GET_MODE_INNER (mode));
> + return gen_const_vec_duplicate (mode, c);
> +}
> +
Both functions are taken from aarch64. I would suggest adding
comments in that respect in case somebody wants to unify that in the
middle-end at some point (and wonders if they really do the same thing).
> +/* This function emits VLMAX vrgather instruction. Emit vrgather.vx/vi when
> sel
> + is a const duplicate vector. Otherwise, emit vrgather.vv. */
> +static void
> +emit_vlmax_gather_insn (rtx target, rtx op, rtx sel)
> +{
> + rtx elt;
> + insn_code icode;
> + machine_mode data_mode = GET_MODE (target);
> + if (const_vec_duplicate_p (sel, &elt))
> + {
> + icode = code_for_pred_gather_scalar (data_mode);
> + sel = elt;
> + }
> + else
> + icode = code_for_pred_gather (data_mode);
> + rtx ops[] = {target, op, sel};
> + emit_vlmax_insn (icode, RVV_BINOP, ops);
> +}
> +
> +static void
> +emit_vlmax_masked_gather_mu_insn (rtx target, rtx op, rtx sel, rtx mask)
> +{
> + rtx elt;
> + insn_code icode;
> + machine_mode data_mode = GET_MODE (target);
> + if (const_vec_duplicate_p (sel, &elt))
> + {
> + icode = code_for_pred_gather_scalar (data_mode);
> + sel = elt;
> + }
> + else
> + icode = code_for_pred_gather (data_mode);
> + rtx ops[] = {target, mask, target, op, sel};
> + emit_vlmax_masked_mu_insn (icode, RVV_BINOP_MU, ops);
> +}
> +
> +/* Implement vec_perm<mode>. */
> +
> +void
> +expand_vec_perm (rtx target, rtx op0, rtx op1, rtx sel)
> +{
> + machine_mode data_mode = GET_MODE (target);
> + machine_mode sel_mode = GET_MODE (sel);
> + /* Enforced by the pattern condition. */
This is supposed to refer to the const-ness of sel_mode?
> + /* Note: vec_perm indices are supposed to wrap when they go beyond the
> + size of the two value vectors, i.e. the upper bits of the indices
> + are effectively ignored. RVV vrgather instead produces 0 for any
> + out-of-range indices, so we need to modulo all the vec_perm indices
> + to ensure they are all in range. */
> + int nunits = GET_MODE_NUNITS (sel_mode).to_constant ();
Ah, saw it in the aarch64 version, it makes more sense when arranged the same
way they did it so I'd suggest using their comment/code order. Where is the
modulo actually happening, though?
> + /* Check if the two values vectors are the same. */
> + if (rtx_equal_p (op0, op1) || const_vec_duplicate_p (sel))
> + {
> + rtx max_sel = gen_const_vector_dup (sel_mode, nunits - 1);
> + rtx sel_mod = expand_simple_binop (sel_mode, AND, sel, max_sel, NULL,
> 0,
> + OPTAB_DIRECT);
Ah, here. But why not also further down? I mean it gets clearer when
reading the whole function a second time but a bit of a strategy comment
upfront would help clear up misunderstandings.
Just thinking out loudly as I read it:
> + rtx max_sel = gen_const_vector_dup (sel_mode, nunits);> +
> + /* Step 1: generate a mask that should select second vector. */
> + expand_vec_cmp (mask, GEU, sel, max_sel);
So we select everything >= nunits into the mask.
> + /* Step2: gather a intermediate result for index < nunits,
> + we don't need to care about the result of the element
> + whose index >= nunits. */
> + emit_vlmax_gather_insn (target, op0, sel);
Then gather every op0 indexed by sel into target. Values indexed
>= nunits are zeroed by vgather.
> +
> + /* Step3: generate a new selector for second vector. */
> + rtx tmp = gen_reg_rtx (sel_mode);
> + rtx ops[] = {tmp, sel, max_sel};
> + emit_vlmax_insn (code_for_pred (MINUS, sel_mode), RVV_BINOP, ops);
Now we shift the range from (nunits, max_of_mode] to
[0, max_of_mode - nunits].
> + /* Step4: merge the result of index >= nunits. */
> + emit_vlmax_masked_gather_mu_insn (target, op1, tmp, mask);
> +}
And now we gather those into the previously masked-out elements
of target. Where do we actually ensure that indices >= 2 * nunits
are used modulo and not zeroed? If it's somewhere I didn't spot
it and it certainly warrants a comment.
Are we actually testing the wraparound/modulo indices? I didn't
immediately see it but might have overlooked something. If so, please
add a comment in the test itself what we expect to happen or scan
for an and or so. Is builtin_shufflevector undefined for values >=
2 * nunits? That would explain it partially.
Regards
Robin