On Thu, 2023-12-28 at 14:59 +0800, Li Wei wrote:
> There are currently two versions of the implementations of constant
> vector permutation: loongarch_expand_vec_perm_const_1 and
> loongarch_expand_vec_perm_const_2. The implementations of the two
> versions are different. Currently, only the implementation of
> loongarch_expand_vec_perm_const_1 is used for 256-bit vectors. We
> hope to streamline the code as much as possible while retaining the
> better-performing implementation of the two. By repeatedly testing
> spec2006 and spec2017, we got the following Merged version.
> Compared with the pre-merger version, the number of lines of code
> in loongarch.cc has been reduced by 888 lines. At the same time,
> the performance of SPECint2006 under Ofast has been improved by 0.97%,
> and the performance of SPEC2017 fprate has been improved by 0.27%.
/* snip */
> - * 3. What LASX permutation instruction does:
> - * In short, it just execute two independent 128bit vector permuatation, and
> - * it's the reason that we need to do the jobs below. We will explain it.
> - * op0, op1, target, and selector will be separate into high 128bit and low
> - * 128bit, and do permutation as the description below:
> - *
> - * a) op0's low 128bit and op1's low 128bit "combines" into a 256bit temp
> - * vector storage (TVS1), elements are indexed as below:
> - * 0 ~ nelt / 2 - 1 nelt / 2 ~ nelt - 1
> - * |---------------------|---------------------| TVS1
> - * op0's low 128bit op1's low 128bit
> - * op0's high 128bit and op1's high 128bit are "combined" into TVS2 in the
> - * same way.
> - * 0 ~ nelt / 2 - 1 nelt / 2 ~ nelt - 1
> - * |---------------------|---------------------| TVS2
> - * op0's high 128bit op1's high 128bit
> - * b) Selector's low 128bit describes which elements from TVS1 will fit into
> - * target vector's low 128bit. No TVS2 elements are allowed.
> - * c) Selector's high 128bit describes which elements from TVS2 will fit
> into
> - * target vector's high 128bit. No TVS1 elements are allowed.
Just curious: why the hardware engineers created such a bizarre
instruction? :)
/* snip */
> + rtx conv_op1 = gen_rtx_SUBREG (E_V4DImode, d->op1, 0);
> + rtx conv_op0 = gen_rtx_SUBREG (E_V4DImode, d->op0, 0);
Can we prove d->op0, d->op1, and d->target are never SUBREGs? Otherwise
I'd use lowpart_subreg (E_V4DImode, d->xxx, d->vmode) here to avoid
creating a nested SUBREG (nested SUBREG will cause an ICE and it has
happened several times before).
/* snip */
> + switch (d->vmode)
> {
> - remapped[i] = d->perm[i];
> + case E_V4DFmode:
> + sel = gen_rtx_CONST_VECTOR (E_V4DImode, gen_rtvec_v (d-
> >nelt,
> +
> rperm));
> + tmp = gen_rtx_SUBREG (E_V4DImode, d->target, 0);
Likewise.
> + emit_move_insn (tmp, sel);
> + break;
> + case E_V8SFmode:
> + sel = gen_rtx_CONST_VECTOR (E_V8SImode, gen_rtvec_v (d-
> >nelt,
> +
> rperm));
> + tmp = gen_rtx_SUBREG (E_V8SImode, d->target, 0);
Likewise.
--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University