On 3/18/19, Frederic Cambus <f...@statdns.com> wrote: > Hi tech@, > > Now that efifb(4) supports remapping the framebuffer in write combining > mode, it's on par with inteldrm regarding display performance as for as > rasops(9) is concerned. > > Therefore, I'm proposing reverting changes which were introduced in > rasops32_putchar() in revision 1.8 [1] as they are now also slowing > things down on efifb(4). > > I used a 6.3M text file for testing display performance: > ftp https://norvig.com/big.txt > > The inteldrm and efifb tests were done on the same machine, the > radeondrm ones were done on another (faster) machine. > > Screen size: 1920x1080 > Font size: 16x32 > > Here are the results (3 runs each) of running: time cat big.txt > > inteldrm: > > 2m39.52s real 0m00.00s user 2m39.52s system > 2m39.74s real 0m00.00s user 2m39.84s system > 2m39.74s real 0m00.00s user 2m39.77s system > > inteldrm (with revert diff applied): > > 1m37m76s real 0m00.00s user 1m37m60s system > 1m37m80s real 0m00.00s user 1m37m56s system > 1m37m43s real 0m00.00s user 1m37m47s system > > efifb: > > 2m40.46s real 0m00.00s user 2m39.43s system > 2m39.49s real 0m00.00s user 2m39.52s system > 2m39.45s real 0m00.00s user 2m39.48s system > > efifb (with revert diff applied): > > 1m37m66s real 0m00.00s user 1m37m19s system > 1m37m17s real 0m00.00s user 1m37m22s system > 1m37m15s real 0m00.00s user 1m37m20s system > > radeondrm: > > 4m40.75s real 0m00.00s user 4m39m75s system > 4m39.84s real 0m00.00s user 4m39m85s system > 4m39.68s real 0m00.00s user 4m39m71s system > > radeondrm (with revert diff applied): > > 0m21.08s real 0m00.00s user 0m21.08s system > 0m21.15s real 0m00.00s user 0m21.05s system > 0m21.10s real 0m00.00s user 0m21.06s system > > [1] > http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7&r2=1.8 > > Comments? OK? > > Index: sys/dev/rasops/rasops32.c > =================================================================== > RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v > retrieving revision 1.8 > diff -u -p -r1.8 rasops32.c > --- sys/dev/rasops/rasops32.c 20 Feb 2017 15:35:05 -0000 1.8 > +++ sys/dev/rasops/rasops32.c 18 Mar 2019 08:15:18 -0000 > @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row, > struct rasops_info *ri; > int32_t *dp, *rp; > u_char *fr; > - uint32_t buffer[64]; > > ri = (struct rasops_info *)cookie; > > @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row, > clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf]; > > if (uc == ' ') { > - for (cnt = 0; cnt < width; cnt++) > - buffer[cnt] = clr[0]; > while (height--) { > dp = rp; > DELTA(rp, ri->ri_stride, int32_t *); > > - memcpy(dp, buffer, width << 2); > + for (cnt = width; cnt; cnt--) > + *dp++ = clr[0]; > } > } else { > uc -= ri->ri_font->firstchar; > @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row, > fr += fs; > DELTA(rp, ri->ri_stride, int32_t *); > > - for (cnt = 0; cnt < width; cnt++) { > - buffer[cnt] = clr[(fb >> 31) & 1]; > + for (cnt = width; cnt; cnt--) { > + *dp++ = clr[(fb >> 31) & 1]; > fb <<= 1; > } > - memcpy(dp, buffer, width << 2); > } > } > > >
So the majority of the win stems from not doing memcpy, but this is only true because the current implementation is heavily pessimized by always doing rep movsq + rep movsb which is particular harmful performance-wise for small sizes. FreeBSD also had this problem and several changes were made to remedy it, see in particular: https://reviews.freebsd.org/D17398 https://reviews.freebsd.org/D17661 Apart from that other important routines were also updated (copyin, copyout, copyinstr and so on). Note this code is still not the fastest it can be, but it's mostly diminishing returns from here. Also note userspace code could use SIMD. I plan to import variants from bionic (BSD-licensed). See https://svnweb.freebsd.org/base/head/sys/amd64/amd64/support.S . You may notice pagezero uses rep. Background page zeroing was removed quite some time ago. I see OpenBSD still has it, but it is most likely detrimental by now. If it is to be kept, a pagezero variant utilizing non-temporal stores still makes sense but a different function should be created for on-demand zeroing. That said, it should be easy to lift a lot of this code. Could you please benchmark with memcpy as implemented above? Not saying this particular patch is wrong, but that the bigger problem should be addressed first. Cheers, -- Mateusz Guzik <mjguzik gmail.com>