Re: rasops(9): revert changes in rasops32_putchar()?

Mateusz Guzik Sat, 23 Mar 2019 07:34:52 -0700

On 3/18/19, Frederic Cambus <f...@statdns.com> wrote:
> Hi tech@,
>
> Now that efifb(4) supports remapping the framebuffer in write combining
> mode, it's on par with inteldrm regarding display performance as for as
> rasops(9) is concerned.
>
> Therefore, I'm proposing reverting changes which were introduced in
> rasops32_putchar() in revision 1.8 [1] as they are now also slowing
> things down on efifb(4).
>
> I used a 6.3M text file for testing display performance:
> ftp https://norvig.com/big.txt
>
> The inteldrm and efifb tests were done on the same machine, the
> radeondrm ones were done on another (faster) machine.
>
> Screen size: 1920x1080
> Font size: 16x32
>
> Here are the results (3 runs each) of running: time cat big.txt
>
> inteldrm:
>
>     2m39.52s real     0m00.00s user     2m39.52s system
>     2m39.74s real     0m00.00s user     2m39.84s system
>     2m39.74s real     0m00.00s user     2m39.77s system
>
> inteldrm (with revert diff applied):
>
>     1m37m76s real     0m00.00s user     1m37m60s system
>     1m37m80s real     0m00.00s user     1m37m56s system
>     1m37m43s real     0m00.00s user     1m37m47s system
>
> efifb:
>
>     2m40.46s real     0m00.00s user     2m39.43s system
>     2m39.49s real     0m00.00s user     2m39.52s system
>     2m39.45s real     0m00.00s user     2m39.48s system
>
> efifb (with revert diff applied):
>
>     1m37m66s real     0m00.00s user     1m37m19s system
>     1m37m17s real     0m00.00s user     1m37m22s system
>     1m37m15s real     0m00.00s user     1m37m20s system
>
> radeondrm:
>
>     4m40.75s real     0m00.00s user     4m39m75s system
>     4m39.84s real     0m00.00s user     4m39m85s system
>     4m39.68s real     0m00.00s user     4m39m71s system
>
> radeondrm (with revert diff applied):
>
>     0m21.08s real     0m00.00s user     0m21.08s system
>     0m21.15s real     0m00.00s user     0m21.05s system
>     0m21.10s real     0m00.00s user     0m21.06s system
>
> [1]
> http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/dev/rasops/rasops32.c.diff?r1=1.7&r2=1.8
>
> Comments? OK?
>
> Index: sys/dev/rasops/rasops32.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/rasops/rasops32.c,v
> retrieving revision 1.8
> diff -u -p -r1.8 rasops32.c
> --- sys/dev/rasops/rasops32.c 20 Feb 2017 15:35:05 -0000      1.8
> +++ sys/dev/rasops/rasops32.c 18 Mar 2019 08:15:18 -0000
> @@ -69,7 +69,6 @@ rasops32_putchar(void *cookie, int row,
>       struct rasops_info *ri;
>       int32_t *dp, *rp;
>       u_char *fr;
> -     uint32_t buffer[64];
>
>       ri = (struct rasops_info *)cookie;
>
> @@ -91,13 +90,12 @@ rasops32_putchar(void *cookie, int row,
>       clr[1] = ri->ri_devcmap[(attr >> 24) & 0xf];
>
>       if (uc == ' ') {
> -             for (cnt = 0; cnt < width; cnt++)
> -                     buffer[cnt] = clr[0];
>               while (height--) {
>                       dp = rp;
>                       DELTA(rp, ri->ri_stride, int32_t *);
>
> -                     memcpy(dp, buffer, width << 2);
> +                     for (cnt = width; cnt; cnt--)
> +                             *dp++ = clr[0];
>               }
>       } else {
>               uc -= ri->ri_font->firstchar;
> @@ -111,11 +109,10 @@ rasops32_putchar(void *cookie, int row,
>                       fr += fs;
>                       DELTA(rp, ri->ri_stride, int32_t *);
>
> -                     for (cnt = 0; cnt < width; cnt++) {
> -                             buffer[cnt] = clr[(fb >> 31) & 1];
> +                     for (cnt = width; cnt; cnt--) {
> +                             *dp++ = clr[(fb >> 31) & 1];
>                               fb <<= 1;
>                       }
> -                     memcpy(dp, buffer, width << 2);
>               }
>       }
>
>
>


So the majority of the win stems from not doing memcpy, but this is only
true because the current implementation is heavily pessimized by always
doing rep movsq + rep movsb which is particular harmful performance-wise
for small sizes.

FreeBSD also had this problem and several changes were made to remedy it,
see in particular:

https://reviews.freebsd.org/D17398
https://reviews.freebsd.org/D17661

Apart from that other important routines were also updated (copyin,
copyout, copyinstr and so on). Note this code is still not the fastest it
can be, but it's mostly diminishing returns from here.

Also note userspace code could use SIMD. I plan to import variants from
bionic (BSD-licensed).

See https://svnweb.freebsd.org/base/head/sys/amd64/amd64/support.S .

You may notice pagezero uses rep. Background page zeroing was removed
quite some time ago. I see OpenBSD still has it, but it is most likely
detrimental by now. If it is to be kept, a pagezero variant utilizing
non-temporal stores still makes sense but a different function should be
created for on-demand zeroing.

That said, it should be easy to lift a lot of this code. Could you please
benchmark with memcpy as implemented above? Not saying this particular
patch is wrong, but that the bigger problem should be addressed first.

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: rasops(9): revert changes in rasops32_putchar()?

Reply via email to