On 2012-06-28 09:20, Jakub Jelinek wrote:
> Perhaps the problem is then that the permutation is much more expensive
> for even/odd.  With even/odd the f2 routine is:
...
>         vpshufb %xmm2, %xmm5, %xmm5
>         vpshufb %xmm1, %xmm4, %xmm4
>         vpor    %xmm4, %xmm5, %xmm4
...
> and with lo/hi it is:
>         vshufps $221, %xmm2, %xmm3, %xmm2

Hmm.  That second has a reformatting delay.

Last week when I pulled the mulv4si3 routine out to i386.c,
I experimented with a few different options, including that
interleave+shufps sequence seen here for lo/hi.  See the 
comment there discussing options and timing.

This also shows a deficiency in our vec_perm logic:

        0L 0H 2L 2H     1L 1H 3L 3H
        0H 2H 0H 2H     1H 3H 1H 3H     2*pshufd
        0H 1H 2H 3H                     punpckldq

without the permutation constants in memory.


r~

Reply via email to