Re: [Pixman] [PATCH 3/4] sse2: affine bilinear fetcher

Siarhei Siamashka Fri, 01 Feb 2013 04:23:20 -0800

On Tue, Jan 29, 2013 at 11:21 AM, Siarhei Siamashka
<[email protected]> wrote:
> > +    if (BILINEAR_INTERPOLATION_BITS < 8)
> > +    {
> > +     const __m128i xmm_xorc7 = _mm_set_epi16 (0, BMSK, 0, BMSK, 0, BMSK, 
> > 0, BMSK);
> > +     const __m128i xmm_addc7 = _mm_set_epi16 (0, 1, 0, 1, 0, 1, 0, 1);
> > +     const __m128i xmm_x = _mm_set_epi16 (dx, dx, dx, dx, dx, dx, dx, dx);
> > +
> > +     /* calculate horizontal weights */
> > +     xmm_wh = _mm_add_epi16 (xmm_addc7, _mm_xor_si128 (xmm_xorc7, xmm_x));
>
> A minor improvement is possible here, which avoids extra calculations:
>
>     const int32_t wh_pair = (BILINEAR_INTERPOLATION_RANGE - dx) | (dx << 16);
>     xmm_wh = _mm_set_epi32 (wh_pair, wh_pair, wh_pair, wh_pair);


I have to take this back. I expected that the reduction of the number
of SSE2 instructions (which should be the bottleneck) would improve
performance and scalar instructions could be run "for free", but
benchmarks are showing strange results and also the compiler generated
code does not look very good (I can see unjustified spills to stack
with gcc 4.7).

Also

wh_pair = (BILINEAR_INTERPOLATION_RANGE - dx) | (dx << 16) =
(BILINEAR_INTERPOLATION_RANGE - dx) + (dx * 65536) =
BILINEAR_INTERPOLATION_RANGE + dx * 65535

The latter variant needs only two scalar instructions (imul + add),
but high multiplication latency may cause performance problems if the
instructions are not scheduled right.

Anyway, I'm going to try a complete assembly implementation of
bilinear scaling on Monday, optimized at least for Intel Atom.

--
Best regards,
Siarhei Siamashka
_______________________________________________
Pixman mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/pixman

Re: [Pixman] [PATCH 3/4] sse2: affine bilinear fetcher

Reply via email to