On Thu, Sep 10, 2015 at 7:16 PM, Siarhei Siamashka <[email protected]> wrote: > > On Thu, 10 Sep 2015 12:27:18 +0300 > Oded Gabbay <[email protected]> wrote: > > > On Sat, Sep 5, 2015 at 10:03 PM, Oded Gabbay <[email protected]> wrote: > > > > > > On Fri, Sep 4, 2015 at 3:39 PM, Siarhei Siamashka > > > <[email protected]> wrote: > > > > Running "lowlevel-blt-bench over_n_8888" on Playstation3 3.2GHz, > > > > Gentoo ppc (32-bit userland) gave the following results: > > > > > > > > before: over_n_8888 = L1: 147.47 L2: 205.86 M:121.07 > > > > after: over_n_8888 = L1: 287.27 L2: 261.09 M:133.48 > > > > > > > > Signed-off-by: Siarhei Siamashka <[email protected]> > > > > --- > > > > pixman/pixman-vmx.c | 54 > > > > +++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > 1 files changed, 54 insertions(+), 0 deletions(-) > > > > > > > > diff --git a/pixman/pixman-vmx.c b/pixman/pixman-vmx.c > > > > index a9bd024..9e551b3 100644 > > > > --- a/pixman/pixman-vmx.c > > > > +++ b/pixman/pixman-vmx.c > > > > @@ -2745,6 +2745,58 @@ vmx_composite_src_x888_8888 > > > > (pixman_implementation_t *imp, > > > > } > > > > > > > > static void > > > > +vmx_composite_over_n_8888 (pixman_implementation_t *imp, > > > > + pixman_composite_info_t *info) > > > > +{ > > > > + PIXMAN_COMPOSITE_ARGS (info); > > > > + uint32_t *dst_line, *dst; > > > > + uint32_t src, ia; > > > > + int i, w, dst_stride; > > > > + vector unsigned int vdst, vsrc, via; > > > > + > > > > + src = _pixman_image_get_solid (imp, src_image, > > > > dest_image->bits.format); > > > > + > > > > + if (src == 0) > > > > + return; > > > > + > > > > + PIXMAN_IMAGE_GET_LINE ( > > > > + dest_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1); > > > > + > > > > + vsrc = (vector unsigned int){src, src, src, src}; > > > > + via = negate (splat_alpha (vsrc)); > > > If we will use the over function (see my next comment), we need to > > > remove the negate() from the above statement, as it is done in the > > > over function. > > > > > > > + ia = ALPHA_8 (~src); > > > > + > > > > + while (height--) > > > > + { > > > > + dst = dst_line; > > > > + dst_line += dst_stride; > > > > + w = width; > > > > + > > > > + while (w && ((uintptr_t)dst & 15)) > > > > + { > > > > + uint32_t d = *dst; > > > > + UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src); > > > > + *dst++ = d; > > > > + w--; > > > > + } > > > > + > > > > + for (i = w / 4; i > 0; i--) > > > > + { > > > > + vdst = pix_multiply (load_128_aligned (dst), via); > > > > + save_128_aligned (dst, pix_add (vsrc, vdst)); > > > > > > Instead of the above two lines, I would simply use the over function > > > in vmx, which does exactly that. So: > > > vdst = over(vsrc, via, load_128_aligned(dst)) > > > save_128_aligned (dst, vdst); > > > > > > I prefer this as it reuses an existing function which helps > > > maintainability, and using it has no impact on performance. > > > > > > > + dst += 4; > > > > + } > > > > + > > > > + for (i = w % 4; --i >= 0;) > > > > + { > > > > + uint32_t d = dst[i]; > > > > + UN8x4_MUL_UN8_ADD_UN8x4 (d, ia, src); > > > > + dst[i] = d; > > > > + } > > > > + } > > > > +} > > > > + > > > > +static void > > > > vmx_composite_over_8888_8888 (pixman_implementation_t *imp, > > > > pixman_composite_info_t *info) > > > > { > > > > @@ -3079,6 +3131,8 @@ FAST_NEAREST_MAINLOOP (vmx_8888_8888_normal_OVER, > > > > > > > > static const pixman_fast_path_t vmx_fast_paths[] = > > > > { > > > > + PIXMAN_STD_FAST_PATH (OVER, solid, null, a8r8g8b8, > > > > vmx_composite_over_n_8888), > > > > + PIXMAN_STD_FAST_PATH (OVER, solid, null, x8r8g8b8, > > > > vmx_composite_over_n_8888), > > > > PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, > > > > vmx_composite_over_8888_8888), > > > > PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, > > > > vmx_composite_over_8888_8888), > > > > PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, > > > > vmx_composite_over_8888_8888), > > > > -- > > > > 1.7.8.6 > > > > > > > > > > Indeed, this implementation is much better than what I did. > > > Apparently, converting sse2 to vmx calls isn't the optimal way. > > > On my POWER8 machine, I get: > > > > > > reference memcpy speed = 24764.8MB/s (6191.2MP/s for 32bpp fills) > > > L1 572.29 1539.47 +169.00% > > > L2 1038.08 1549.04 +49.22% > > > M 1104.1 1522.22 +37.87% > > > HT 447.45 676.32 +51.15% > > > VT 520.82 764.82 +46.85% > > > R 407.92 570.54 +39.87% > > > RT 148.9 208.77 +40.21% > > > Kops/s 1100 1418 +28.91% > > > > > > So, assuming the change above, this patch is: > > > > > > Reviewed-by: Oded Gabbay <[email protected]> > > > > > > Hi Siarhei, > > Hi, > > > After I fixed my cairo setup (See > > http://lists.freedesktop.org/archives/pixman/2015-September/003987.html), > > Interesting. How did it happen to be wrong in the first place? Is there > anything missing or incorrect in the test scripts or usage instructions? > > > I went and re-tested your patch with cairo trimmed benchmark against > > current pixman master. > > Unfortunately, it gives a minor slowdown: > > > > Slowdowns > > ========= > > t-firefox-scrolling 1232.30 -> 1295.75 : 1.05x slowdown > > > > even if I apply your patch over my latest patch-set (that was inspired > > by your patch), I still get a slowdown, albeit in a different trace: > > > > Slowdowns > > ========= > > t-firefox-asteroids 440.01 -> 469.68: 1.07x > > > > What's your take on this ? > > Are these results consistently reproducible across multiple runs? > You can also try to set the cut-off threshold to 1% instead of the > default 5% in the cairo-perf-diff-files tool: > > ./cairo-perf-diff-files --min-change 1% old.txt new.txt > > Anyway, this looks like the measurement accuracy may be not very good. > As the readme at https://github.com/ssvb/trimmed-cairo-traces says, the > trimmed traces had been trimmed to run faster on very low end hardware > (such as the Raspberry Pi). Otherwise they would take many hours to > complete. And if you are running benchmarks on a high end POWER8 box, > then it probably makes sense to try the original traces from: > http://cgit.freedesktop.org/cairo-traces/tree/benchmark > I guess, right now the reported times are probably in the ballpark > of a fraction of a second for each test on your hardware. >
I took your advice and run benchmarks with the *non* trimmed traces. It takes 144 minutes to run a benchmark on POWER8, 3.4GHz 8 Cores (a raw machine, not VM). I compared with and without this patch and I got: image ocitysmap 659.69 -> 611.71 : 1.08x speedup image xfce4-terminal-a1 2725.22 -> 2547.47 : 1.07x speedup So I guess we can merge this patch, because I prefer the non-trimmed results over the trimmed ones. The low-level-blt giving significant improvement is also a good sign. Therefore: Reviewed-by: Oded Gabbay <[email protected]> Oded > Is the system really undisturbed during the test? You can also try to > pin the execution to a single core via "taskset" and check if this > changes anything. > > This whole issue definitely needs some investigation. > > Profiling the t-firefox-scrolling trace replay on my Playstation3: > > # perf report > 32.10% cairo-perf-trac [kernel.kallsyms] [k] > .raw_local_irq_restore > 31.77% cairo-perf-trac libc-2.20.so [.] > _wordcopy_fwd_aligned > 16.01% cairo-perf-trac libpixman-1.so.0.33.3 [.] > vmx_composite_over_n_8888_8888_ca > 1.80% cairo-perf-trac [kernel.kallsyms] [k] > .handle_mm_fault > 0.74% cairo-perf-trac [kernel.kallsyms] [k] > .unmap_vmas > 0.68% cairo-perf-trac [kernel.kallsyms] [k] > .do_page_fault > 0.58% cairo-perf-trac libcairo.so.2.11200.12 [.] > _cairo_scaled_font_glyph_device_extents > 0.52% cairo-perf-trac [kernel.kallsyms] [k] > .__alloc_pages_nodemask > 0.49% cairo-perf-trac [kernel.kallsyms] [k] > .do_raw_spin_lock > 0.46% cairo-perf-trac libpixman-1.so.0.33.3 [.] > pixman_composite_glyphs_no_mask > 0.44% cairo-perf-trac [kernel.kallsyms] [k] > .get_page_from_freelist > 0.42% cairo-perf-trac [kernel.kallsyms] [k] > .page_remove_rmap > > # perf report -d libpixman-1.so.0.33.3 > 87.37% cairo-perf-trac [.] vmx_composite_over_n_8888_8888_ca > 2.53% cairo-perf-trac [.] pixman_composite_glyphs_no_mask > 1.98% cairo-perf-trac [.] vmx_combine_over_u_no_mask > 1.74% cairo-perf-trac [.] bits_image_fetch_bilinear_affine_pad_x8r8g8b8 > 1.57% cairo-perf-trac [.] vmx_fill > 1.04% cairo-perf-trac [.] lookup_glyph > 0.53% cairo-perf-trac [.] _pixman_image_get_solid > 0.42% cairo-perf-trac [.] pixman_region32_rectangles > 0.39% cairo-perf-trac [.] fast_composite_src_memcpy > 0.30% cairo-perf-trac [.] hash > 0.25% cairo-perf-trac [.] 00008000.got2.plt_pic32.memcpy@@GLIBC_2.0 > 0.23% cairo-perf-trac [.] pixman_glyph_cache_lookup > 0.19% cairo-perf-trac [.] _pixman_image_validate > 0.12% cairo-perf-trac [.] > 00008000.got2.plt_pic32.pixman_region32_rectangles > 0.11% cairo-perf-trac [.] pixman_image_create_solid_fill > 0.10% cairo-perf-trac [.] vmx_combine_add_u_no_mask > 0.10% cairo-perf-trac [.] pixman_unorm_to_float > > The vmx_composite_over_n_8888 fast path is not expected to make any > measurable contribution to the results. The execution time is mostly > dominated by memcpy from glibc and vmx_composite_over_n_8888_8888_ca > from pixman. > > -- > Best regards, > Siarhei Siamashka _______________________________________________ Pixman mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/pixman
