On 06/17/2014 05:33 AM, Evgeny Stupachenko wrote: > + 1st vec: 0 1 2 3 4 5 6 7 > + 2nd vec: 8 9 10 11 12 13 14 15 > + 3rd vec: 16 17 18 19 20 21 22 23 > + > + The output sequence should be: > + > + 1st vec: 0 3 6 9 12 15 18 21 > + 2nd vec: 1 4 7 10 13 16 19 22 > + 3rd vec: 2 5 8 11 14 17 20 23 > + > + We use 3 shuffle instructions and 3 * 3 - 1 shifts to create such output.
Why not 3 * 2 blend followed by 3 shuffle? When length is prime, as here, we know that no blend will ever overlap elements. So: 1st step A1 = blend V1 V2 = 0 9 2 3 12 5 6 15 A2 = blend V1 V2 = 8 1 10 11 4 13 14 7 A3 = blend V1 V3 = 16 17 2 19 20 5 22 23 2nd step B1 = blend A1 V3 = 0 9 18 3 12 21 6 15 B2 = blend A2 V3 = 16 1 10 19 4 13 22 7 B3 = blend A3 V2 = 8 17 2 11 20 5 14 23 3rd step C1 = perm B1 = 0 3 6 9 12 15 18 21 C2 = perm B2 = 1 4 7 10 13 16 19 22 C3 = perm B3 = 2 5 8 11 14 17 20 23 The final permute here isn't trivial, crossing lanes for avx2 and all, but the initial permute you use is similar. r~