On Tue, 25 Aug 2015 18:29:59 +0300 Oded Gabbay <[email protected]> wrote:
> On Tue, Aug 25, 2015 at 5:02 PM, Ben Avison <[email protected]> wrote: > > On Tue, 25 Aug 2015 13:45:48 +0100, Oded Gabbay <[email protected]> > > wrote: > >>> > >>> [exposing general_composite_rect] > >>> I can't say that any cleaner solution has occurred to me since then. > >> > >> > >> I think the more immediate solution, as Soren have suggested on IRC, > >> is for me to implement the equivalent fast-path in VMX. > >> I see that it is already implemented in mmx, sse2, mips-dspr2 and > >> arm-neon. From looking at the C code, I'm guessing that it is fairly > >> simple to implement. > > > > > > Yes, it's definitely one of the simpler fast paths, with only two > > channels to worry about (source and destination) and with one of them > > being a constant. I wrote an arm-simd version as well, to add to your > > list - it's just that it's still waiting to be committed :) > > > > I probably ought to get round to exposing general_composite_rect sooner > > rather than later anyway - it's one of the few things from my mammoth > > patch series last year that Søren commented on and which I haven't got > > round to revising yet. > > > >>> I just had a quick look at the VMX source file, and it has hardly any > >>> iters defined. My guess would be that what's being used is > >>> > >>> noop_init_solid_narrow() from pixman-noop.c > >>> _pixman_iter_get_scanline_noop() from pixman-utils.c > >>> combine_src_u() from pixman-combine32.c > >>> > >> I run perf on lowlevel-blt-bench over_n_8888 and what I got is: > >> > >> - 48.71% 48.68% lowlevel-blt-be lowlevel-blt-bench [.] > >> vmx_combine_over_u_no_mask > >> - vmx_combine_over_u_no_mask > > > > > > Sorry, my mistake - for some reason I must have thought we were dealing > > with src_n_8888 rather than over_n_8888. If you can beat the C version > > using a solid fetcher (which fills a temporary buffer the size of the row > > with a constant pixel value) and an optimised OVER combiner, then you > > should be able to do better still if you cut out the temporary buffer and > > keep the solid colour in registers. > > > > I implemented over_n_8888 for vmx (adapted from sse2) and run the > lowlevel benchmark. I got degradation in almost all the benches (on > POWER8, ppc64le): > > reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) > L1 572.29 676.6 +18.23% > L2 1038.08 672.68 -35.20% > M 1104.1 682.63 -38.17% > HT 447.45 269.15 -39.85% > VT 520.82 357.1 -31.44% > R 407.92 259.46 -36.39% > RT 148.9 100.25 -32.67% > Kops/s 1100 910 -17.27% > > so I'm not inclined on adding this slow-path :) If it is slower, then you are probably implementing it wrong :) Please try http://patchwork.freedesktop.org/patch/58669/ -- Best regards, Siarhei Siamashka _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
