On Mon, Oct 06, 2014 at 06:09:07PM +0400, Ilya Tocar wrote: > > Speaking of -mavx512{bw,vl,f}, there apparently is a full 2 operand shuffle > > for V32HI, V16S[IF], V8D[IF], so the only one instruction full > > 2 operand shuffle we are missing is V64QI, right? > > > > What would be best worst case sequence for that? > > > > I'd think 2x vpermi2w, 2x vpshufb and one vpor could achieve that, > > (first vpermi2w would put the even bytes into the right word positions > > (i.e. at the right position or one above it), second vpermi2w would put > > the odd bytes into the right word positions (i.e. at the right position > > or one below it), > I think we will also need to spend insns converting byte-sized mask into > word-sized mask.
I'm talking about the constant permutations here (see my other mail to Kirill). In that case, you can tweak the mask as much as you want. I mean something like (completely untested, would need a separate function): if (TARGET_AVX512BW && d->vmode == V64QImode) ; else return false; /* We can emit arbitrary two operand V64QImode permutations with 2 vpermi2w, 2 vpshufb and one vpor instruction. */ if (d->testing_p) return true; struct expand_vec_perm_d ds[2]; rtx rperm[128], vperm, target0, target1; for (i = 0; i < 2; i++) { ds[i] = *d; ds[i].vmode = V32HImode; ds[i].nelt = 32; ds[i].target = gen_reg_rtx (V32HImode); ds[i].op0 = gen_lowpart (V32HImode, d->op0); ds[i].op1 = gen_lowpart (V32HImode, d->op1); } /* Prepare permutations such that the first one takes care of putting the even bytes into the right positions or one higher positions (ds[0]) and the second one takes care of putting the odd bytes into the right positions or one below (ds[1]). for (i = 0; i < nelt; i++) { ds[i & 1].perm[i / 2] = d->perm[i] / 2; if (i & 1) { rperm[i] = constm1_rtx; rperm[i + 64] = GEN_INT ((i & 14) + 1 - (d->perm[i] & 1)); } else { rperm[i] = GEN_INT ((i & 14) + (d->perm[i] & 1)); rperm[i + 64] = constm1_rtx; } } bool ok = expand_vec_perm_1 (&ds[0]); gcc_assert (ok); ds[0].target = gen_lowpart (V64QImode, ds[0].target); ok = expand_vec_perm_1 (&ds[1]); gcc_assert (ok); ds[1].target = gen_lowpart (V64QImode, ds[1].target); vperm = gen_rtx_CONST_VECTOR (V64QImode, gen_rtvec_v (64, rperm)); vperm = force_reg (vmode, vperm); target0 = gen_reg_rtx (V64QImode); emit_insn (gen_avx512bw_pshufbv64qi3 (target0, ds[0].target, vperm)); vperm = gen_rtx_CONST_VECTOR (V64QImode, gen_rtvec_v (64, rperm + 64)); vperm = force_reg (vmode, vperm); target1 = gen_reg_rtx (V64QImode); emit_insn (gen_avx512bw_pshufbv64qi3 (target1, ds[1].target, vperm)); emit_insn (gen_iorv64qi3 (d->target, target0, target1)); Jakub