On Mon, 24 Aug 2015 15:20:22 +0100, Oded Gabbay <[email protected]> wrote:
I tested the patch on POWER8, ppc64le.
make check passes, but when I benchmarked the performance using
lowlevel-blt-bench over_n_8888, I got far worse results than without
this patch (see numbers below).
Apparently, the new C fast-path takes precedence over some vmx combine
fast-paths, thus making performance worse instead of better.

If that's true, it wouldn't be the first time that this issue has arisen.
Pixman is designed to scan all fast paths (i.e. routines which take
source, mask and destination images and perform the whole operation
themselves) irrespective of how well tuned they are for a given platform,
before it starts looking at iter routines (each of which reads or writes
only one of source, mask or destination).

Previously, in response to my attempt to work around a case where this
was happening, Søren wrote:
[...] the longstanding problem that pixman sometimes won't
pick the fastest code for some operations. In this case the general
iter based code will be faster then the C code in pixman-fast-path.c
because the iter code will use assembly fetchers.
As a result you end up with a bunch of partial reimplementations of
general_composite_rect() inside pixman-arm-simd.c.
Maybe we need to admit failure and make general_composite_rect()
available for other implementations to use. Essentially we would
officially provide a way for implementations to say: My iterators are
faster than pixman-fast-path.c for these specific operations, so just
go directly to general_composite_rect().
It's definitely not a pleasant thing to do, but given that nobody is
likely to have the time/skill combination required to do the necessary
redesign of pixman's fast path system, it might still be preferable to
to do this instead of duplicating the code like these patches do.
With a setup like that, we could also fix the same issue for the
bilinear SSSE3 fetchers and possibly other cases.

(ref http://lists.freedesktop.org/archives/pixman/2014-October/003457.html)

I can't say that any cleaner solution has occurred to me since then.

I just had a quick look at the VMX source file, and it has hardly any
iters defined. My guess would be that what's being used is

noop_init_solid_narrow() from pixman-noop.c
_pixman_iter_get_scanline_noop() from pixman-utils.c
combine_src_u() from pixman-combine32.c

this last one would be responsible for the the bulk of the runtime - and
it palms most of the work off on memcpy(). Presumably the PPC memcpy is
super-optimised?

Without the patch:

reference memcpy speed = 25711.7MB/s (6427.9MP/s for 32bpp fills)
---
over_n_8888: PIXMAN_OP_OVER, src a8r8g8b8 solid, mask null, dst a8r8g8b8
---
             over_n_8888 =  L1: 572.29  L2:1038.08  M:1104.10 (
17.18%)  HT:447.45  VT:520.82  R:407.92  RT:148.90 (1100Kops/s)

There's something a bit odd about that - it's slower when working within
the caches (especially the L1 cache) than when operating on main memory.
I'd hazard a guess that memcpy() is using some hardware acceleration that
lives between the L1 and L2 caches.

Presumably for patch 3 of this series (over_n_0565) you wouldn't see the
same effect, as that can't be achieved using mempcy().

Ben
_______________________________________________
Pixman mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/pixman

Reply via email to