On 2012-06-28 09:20, Jakub Jelinek wrote: > Perhaps the problem is then that the permutation is much more expensive > for even/odd. With even/odd the f2 routine is: ... > vpshufb %xmm2, %xmm5, %xmm5 > vpshufb %xmm1, %xmm4, %xmm4 > vpor %xmm4, %xmm5, %xmm4 ... > and with lo/hi it is: > vshufps $221, %xmm2, %xmm3, %xmm2
Hmm. That second has a reformatting delay. Last week when I pulled the mulv4si3 routine out to i386.c, I experimented with a few different options, including that interleave+shufps sequence seen here for lo/hi. See the comment there discussing options and timing. This also shows a deficiency in our vec_perm logic: 0L 0H 2L 2H 1L 1H 3L 3H 0H 2H 0H 2H 1H 3H 1H 3H 2*pshufd 0H 1H 2H 3H punpckldq without the permutation constants in memory. r~