> From: Mateusz Guzik <mjgu...@gmail.com> > Date: Sat, 23 Mar 2019 15:29:39 +0100 > > On 3/18/19, Frederic Cambus <f...@statdns.com> wrote: > > Hi tech@, > > > > Now that efifb(4) supports remapping the framebuffer in write combining > > mode, it's on par with inteldrm regarding display performance as for as > > rasops(9) is concerned. > > > > Therefore, I'm proposing reverting changes which were introduced in > > rasops32_putchar() in revision 1.8 [1] as they are now also slowing > > things down on efifb(4). > > > > I used a 6.3M text file for testing display performance: > > ftp https://norvig.com/big.txt > > > > The inteldrm and efifb tests were done on the same machine, the > > radeondrm ones were done on another (faster) machine. > > > > Screen size: 1920x1080 > > Font size: 16x32 > > > > Here are the results (3 runs each) of running: time cat big.txt > > > > inteldrm: > > > > 2m39.52s real 0m00.00s user 2m39.52s system > > 2m39.74s real 0m00.00s user 2m39.84s system > > 2m39.74s real 0m00.00s user 2m39.77s system > > > > inteldrm (with revert diff applied): > > > > 1m37m76s real 0m00.00s user 1m37m60s system > > 1m37m80s real 0m00.00s user 1m37m56s system > > 1m37m43s real 0m00.00s user 1m37m47s system > > > > efifb: > > > > 2m40.46s real 0m00.00s user 2m39.43s system > > 2m39.49s real 0m00.00s user 2m39.52s system > > 2m39.45s real 0m00.00s user 2m39.48s system > > > > efifb (with revert diff applied): > > > > 1m37m66s real 0m00.00s user 1m37m19s system > > 1m37m17s real 0m00.00s user 1m37m22s system > > 1m37m15s real 0m00.00s user 1m37m20s system > > > > radeondrm: > > > > 4m40.75s real 0m00.00s user 4m39m75s system > > 4m39.84s real 0m00.00s user 4m39m85s system > > 4m39.68s real 0m00.00s user 4m39m71s system > > > > radeondrm (with revert diff applied): > > > > 0m21.08s real 0m00.00s user 0m21.08s system > > 0m21.15s real 0m00.00s user 0m21.05s system > > 0m21.10s real 0m00.00s user 0m21.06s system > > > > [1] > > http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7&r2=1.8 > > > > Comments? OK? > > > > Index: sys/dev/rasops/rasops32.c > > =================================================================== > > RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v > > retrieving revision 1.8 > > diff -u -p -r1.8 rasops32.c > > --- sys/dev/rasops/rasops32.c 20 Feb 2017 15:35:05 -0000 1.8 > > +++ sys/dev/rasops/rasops32.c 18 Mar 2019 08:15:18 -0000 > > @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row, > > struct rasops_info *ri; > > int32_t *dp, *rp; > > u_char *fr; > > - uint32_t buffer[64]; > > > > ri = (struct rasops_info *)cookie; > > > > @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row, > > clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf]; > > > > if (uc == ' ') { > > - for (cnt = 0; cnt < width; cnt++) > > - buffer[cnt] = clr[0]; > > while (height--) { > > dp = rp; > > DELTA(rp, ri->ri_stride, int32_t *); > > > > - memcpy(dp, buffer, width << 2); > > + for (cnt = width; cnt; cnt--) > > + *dp++ = clr[0]; > > } > > } else { > > uc -= ri->ri_font->firstchar; > > @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row, > > fr += fs; > > DELTA(rp, ri->ri_stride, int32_t *); > > > > - for (cnt = 0; cnt < width; cnt++) { > > - buffer[cnt] = clr[(fb >> 31) & 1]; > > + for (cnt = width; cnt; cnt--) { > > + *dp++ = clr[(fb >> 31) & 1]; > > fb <<= 1; > > } > > - memcpy(dp, buffer, width << 2); > > } > > } > > > > > > > > So the majority of the win stems from not doing memcpy, but this is only > true because the current implementation is heavily pessimized by always > doing rep movsq + rep movsb which is particular harmful performance-wise > for small sizes. > > FreeBSD also had this problem and several changes were made to remedy it, > see in particular: > > https://reviews.freebsd.org/D17398 > https://reviews.freebsd.org/D17661 > > Apart from that other important routines were also updated (copyin, > copyout, copyinstr and so on). Note this code is still not the fastest it > can be, but it's mostly diminishing returns from here. > > Also note userspace code could use SIMD. I plan to import variants from > bionic (BSD-licensed). > > See https://svnweb.freebsd.org/base/head/sys/amd64/amd64/support.S . > > You may notice pagezero uses rep. Background page zeroing was removed > quite some time ago. I see OpenBSD still has it, but it is most likely > detrimental by now. If it is to be kept, a pagezero variant utilizing > non-temporal stores still makes sense but a different function should be > created for on-demand zeroing. > > That said, it should be easy to lift a lot of this code. Could you please > benchmark with memcpy as implemented above? Not saying this particular > patch is wrong, but that the bigger problem should be addressed first.
Valid points. However, I'd argue that using memcpy here is actually wrong because a framebuffer may not be "normal" memory. An optimized memcpy might be slower, or worse, introduce artifacts because of limitations of the underlying bus protocols (this happens on some ARM designs).