On Thu, 2023-12-28 at 14:59 +0800, Li Wei wrote:
> There are currently two versions of the implementations of constant
> vector permutation: loongarch_expand_vec_perm_const_1 and
> loongarch_expand_vec_perm_const_2.  The implementations of the two
> versions are different. Currently, only the implementation of
> loongarch_expand_vec_perm_const_1 is used for 256-bit vectors.  We
> hope to streamline the code as much as possible while retaining the
> better-performing implementation of the two.  By repeatedly testing
> spec2006 and spec2017, we got the following Merged version.
> Compared with the pre-merger version, the number of lines of code
> in loongarch.cc has been reduced by 888 lines.  At the same time,
> the performance of SPECint2006 under Ofast has been improved by 0.97%,
> and the performance of SPEC2017 fprate has been improved by 0.27%.

/* snip */

> - * 3. What LASX permutation instruction does:
> - * In short, it just execute two independent 128bit vector permuatation, and
> - * it's the reason that we need to do the jobs below.  We will explain it.
> - * op0, op1, target, and selector will be separate into high 128bit and low
> - * 128bit, and do permutation as the description below:
> - *
> - *  a) op0's low 128bit and op1's low 128bit "combines" into a 256bit temp
> - * vector storage (TVS1), elements are indexed as below:
> - *       0 ~ nelt / 2 - 1      nelt / 2 ~ nelt - 1
> - *   |---------------------|---------------------| TVS1
> - *       op0's low 128bit      op1's low 128bit
> - *    op0's high 128bit and op1's high 128bit are "combined" into TVS2 in the
> - *    same way.
> - *       0 ~ nelt / 2 - 1      nelt / 2 ~ nelt - 1
> - *   |---------------------|---------------------| TVS2
> - *       op0's high 128bit   op1's high 128bit
> - *  b) Selector's low 128bit describes which elements from TVS1 will fit into
> - *  target vector's low 128bit.  No TVS2 elements are allowed.
> - *  c) Selector's high 128bit describes which elements from TVS2 will fit 
> into
> - *  target vector's high 128bit.  No TVS1 elements are allowed.

Just curious: why the hardware engineers created such a bizarre
instruction? :)

/* snip */

> +       rtx conv_op1 = gen_rtx_SUBREG (E_V4DImode, d->op1, 0);
> +       rtx conv_op0 = gen_rtx_SUBREG (E_V4DImode, d->op0, 0);

Can we prove d->op0, d->op1, and d->target are never SUBREGs?  Otherwise
I'd use lowpart_subreg (E_V4DImode, d->xxx, d->vmode) here to avoid
creating a nested SUBREG (nested SUBREG will cause an ICE and it has
happened several times before).

/* snip */

> +       switch (d->vmode)
>           {
> -           remapped[i] = d->perm[i];
> +         case E_V4DFmode:
> +           sel = gen_rtx_CONST_VECTOR (E_V4DImode, gen_rtvec_v (d-
> >nelt,
> +                                                               
> rperm));
> +           tmp = gen_rtx_SUBREG (E_V4DImode, d->target, 0);

Likewise.

> +           emit_move_insn (tmp, sel);
> +           break;
> +         case E_V8SFmode:
> +           sel = gen_rtx_CONST_VECTOR (E_V8SImode, gen_rtvec_v (d-
> >nelt,
> +                                                               
> rperm));
> +           tmp = gen_rtx_SUBREG (E_V8SImode, d->target, 0);

Likewise.

-- 
Xi Ruoyao <xry...@xry111.site>
School of Aerospace Science and Technology, Xidian University

Reply via email to