On Tue, Jan 29, 2013 at 11:21 AM, Siarhei Siamashka <[email protected]> wrote: > > + if (BILINEAR_INTERPOLATION_BITS < 8) > > + { > > + const __m128i xmm_xorc7 = _mm_set_epi16 (0, BMSK, 0, BMSK, 0, BMSK, > > 0, BMSK); > > + const __m128i xmm_addc7 = _mm_set_epi16 (0, 1, 0, 1, 0, 1, 0, 1); > > + const __m128i xmm_x = _mm_set_epi16 (dx, dx, dx, dx, dx, dx, dx, dx); > > + > > + /* calculate horizontal weights */ > > + xmm_wh = _mm_add_epi16 (xmm_addc7, _mm_xor_si128 (xmm_xorc7, xmm_x)); > > A minor improvement is possible here, which avoids extra calculations: > > const int32_t wh_pair = (BILINEAR_INTERPOLATION_RANGE - dx) | (dx << 16); > xmm_wh = _mm_set_epi32 (wh_pair, wh_pair, wh_pair, wh_pair);
I have to take this back. I expected that the reduction of the number of SSE2 instructions (which should be the bottleneck) would improve performance and scalar instructions could be run "for free", but benchmarks are showing strange results and also the compiler generated code does not look very good (I can see unjustified spills to stack with gcc 4.7). Also wh_pair = (BILINEAR_INTERPOLATION_RANGE - dx) | (dx << 16) = (BILINEAR_INTERPOLATION_RANGE - dx) + (dx * 65536) = BILINEAR_INTERPOLATION_RANGE + dx * 65535 The latter variant needs only two scalar instructions (imul + add), but high multiplication latency may cause performance problems if the instructions are not scheduled right. Anyway, I'm going to try a complete assembly implementation of bilinear scaling on Monday, optimized at least for Intel Atom. -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
