On Thu, 2023-12-28 at 14:59 +0800, Li Wei wrote: > There are currently two versions of the implementations of constant > vector permutation: loongarch_expand_vec_perm_const_1 and > loongarch_expand_vec_perm_const_2. The implementations of the two > versions are different. Currently, only the implementation of > loongarch_expand_vec_perm_const_1 is used for 256-bit vectors. We > hope to streamline the code as much as possible while retaining the > better-performing implementation of the two. By repeatedly testing > spec2006 and spec2017, we got the following Merged version. > Compared with the pre-merger version, the number of lines of code > in loongarch.cc has been reduced by 888 lines. At the same time, > the performance of SPECint2006 under Ofast has been improved by 0.97%, > and the performance of SPEC2017 fprate has been improved by 0.27%.
/* snip */ > - * 3. What LASX permutation instruction does: > - * In short, it just execute two independent 128bit vector permuatation, and > - * it's the reason that we need to do the jobs below. We will explain it. > - * op0, op1, target, and selector will be separate into high 128bit and low > - * 128bit, and do permutation as the description below: > - * > - * a) op0's low 128bit and op1's low 128bit "combines" into a 256bit temp > - * vector storage (TVS1), elements are indexed as below: > - * 0 ~ nelt / 2 - 1 nelt / 2 ~ nelt - 1 > - * |---------------------|---------------------| TVS1 > - * op0's low 128bit op1's low 128bit > - * op0's high 128bit and op1's high 128bit are "combined" into TVS2 in the > - * same way. > - * 0 ~ nelt / 2 - 1 nelt / 2 ~ nelt - 1 > - * |---------------------|---------------------| TVS2 > - * op0's high 128bit op1's high 128bit > - * b) Selector's low 128bit describes which elements from TVS1 will fit into > - * target vector's low 128bit. No TVS2 elements are allowed. > - * c) Selector's high 128bit describes which elements from TVS2 will fit > into > - * target vector's high 128bit. No TVS1 elements are allowed. Just curious: why the hardware engineers created such a bizarre instruction? :) /* snip */ > + rtx conv_op1 = gen_rtx_SUBREG (E_V4DImode, d->op1, 0); > + rtx conv_op0 = gen_rtx_SUBREG (E_V4DImode, d->op0, 0); Can we prove d->op0, d->op1, and d->target are never SUBREGs? Otherwise I'd use lowpart_subreg (E_V4DImode, d->xxx, d->vmode) here to avoid creating a nested SUBREG (nested SUBREG will cause an ICE and it has happened several times before). /* snip */ > + switch (d->vmode) > { > - remapped[i] = d->perm[i]; > + case E_V4DFmode: > + sel = gen_rtx_CONST_VECTOR (E_V4DImode, gen_rtvec_v (d- > >nelt, > + > rperm)); > + tmp = gen_rtx_SUBREG (E_V4DImode, d->target, 0); Likewise. > + emit_move_insn (tmp, sel); > + break; > + case E_V8SFmode: > + sel = gen_rtx_CONST_VECTOR (E_V8SImode, gen_rtvec_v (d- > >nelt, > + > rperm)); > + tmp = gen_rtx_SUBREG (E_V8SImode, d->target, 0); Likewise. -- Xi Ruoyao <xry...@xry111.site> School of Aerospace Science and Technology, Xidian University