On 8/2/19 10:54 AM, Pekka Paalanen wrote:
On Thu, 1 Aug 2019 19:02:57 +0200
Hi Martin,
I'd like to ask if we can discuss technical topics in public,
e.g. this would be on topic for wayland-devel@ mailing list. The answers
may benefit more people, and OTOH I don't know everything. :-)
If you are fine with that, please copy this whole email to
wayland-devel@ when you reply.
Hi Pekka,
sure, cc'ing the list. I'll ask directly there next time.
Ponderings below.
I'm implementing DMABuf backend HW buffers for Firefox and I wonder if
you can give me some advice regards it as I have some difficulties with
the $SUBJ.
I implemented basic dmabuf rendering (allocate/create dmabuf, draw into
it, bind it as wl_buffer and sent to compositor), that code lives at
https://searchfox.org/mozilla-central/source/widget/gtk/WaylandDMABufSurface.cpp
and seems to be working somehow. Now comes the difficult part - I need
to map dmabuf to CPU memory and draw into it by SKIA and then send it to
a different process and make EGLImage/GL texture from it there.
What's the best way how to do that? I tried gbm_bo_import() with
GBM_BO_IMPORT_FD (I used fd which was returned from gbm_bo_get_fd()) but
that fails with "invalid argument" although all params seems to be sane.
>>
Do I need to configure/export the gbm object somehow and it that
supposed to work? Or shall I use the DRM prime for it?
Nowadays, all dmabufs *should* be mmappable for CPU access. You'd do it
by just mmap() on the dmabuf fd. E.g. gbm_bo_get_fd() gives you a
dmabuf fd. Of course, you'll have to ensure it's in linear format or
you get to deal with tiling manually.
That was caused by wrong fd which was malformed inside Firefox
machinery, gbm_bo_import() works now.
Note that some pixel formats or modifiers may imply multiple dmabufs
per image, so make sure you don't hit those cases - if you limit to the
usual RGBA formats and the linear modifier, you're fine. YUV less so.
Yes, that's my case.
An important detail is that you *must* use DMA_BUF_IOCTL_SYNC to
bracket your actual CPU read/write sequences to the dmabuf. That ioctl
will ensure the appropriate caches are flushed correctly (you might not
notice anything wrong on x86, but on other hardware forgetting to do
that can randomly result in bad data), and I think it also waits for
implicit fences (e.g. if you had GPU write to the dmabuf earlier, to
ensure the operation finished).
I'm not completely sure if all DRM driver everywhere already support
mmapping the dmabufs they created.
The other catch is that "casual" CPU access could be extremely slow.
Think about uncached memory through PCI bus or something. That's why it
is usually avoided as much as possible. Definitely a bad idea to do
read-modify-write cycles to it (that is, blending) in general.
The main goal is to use it for video frames export & rendering in
Firefox so there should not be any reads.
It all depends on where you allocated the buffer, how the respective
driver works, and how the hardware works. If it comes from a discrete
GPU driver, the above caveats are likely. If it's an IGP, you might have
easier time. If it's the VGEM driver, I guess that could be nice. But
one cannot really know until you benchmark the exact system you're
running on.
If you end up having to use a shadow buffer for the CPU rendering
anyway, it might be best to let the OpenGL driver worry about getting
the data to the GPU (glTexImage2D, or some extension that does not imply
an extra CPU copy like glTexImage2D does).
Another option may be to create EGLImage on top of the buffer and send
the EGLImage to the render process.
Let's take a step back to see if I understand your use case correctly.
You want to do software rendering into a buffer, pass that buffer to
another process, and then have a GPU directly texture from the buffer.
Is that correct?
Yes, that's correct. Software rendering into a buffer comes from video
decoder in content process, then the buffer is passed to render process
and bound as texture and drawn/composited by EGL to screen. Yes, the
zero-copy is the ultimate goal here.
If so, my knowledge of that is hazy. I'm not sure there even exists one
zero-copy solution that is supposed to both work everywhere and be
fairly performant. GPUs can be very picky on what they can texture
from. You need to allocate the buffer to suit the GPU, but sometimes
that is in direct conflict with wanting to have efficient CPU access to
it.
I'm not sure what to recommend.
Mozilla aims to support recent intel drivers first so it's fine when
that works on that subset of HW. Other ones can use existing SW
rendering path which is used now.
FWIW, I've been working on improving the performance of DisplayLink
devices on Mutter. There is a stack of fallbacks for the zero copy case,
and I haven't finished implementing the zero copy case yet. In all
cases, the GPU is rendering the image, and the DisplayLink device (DL)
is display-only and needs CPU access to the buffer as it is a virtual
DRM driver.
- zero copy by either allocating on the GPU and importing to DL, or
allocating on DL and importing to GPU;
- GPU copy from a temporary GPU buffer into a DL buffer imported to GPU;
- CPU copy (glReadPixels) from a temporary GPU buffer into a mmapped DL
buffer.
The trouble is that each "import" may fail for whatever reason, and I
must have a fallback.
My case is the opposite of your case: I have GPU writing and CPU
reading, you have CPU writing and GPU reading.
Also I wonder if it's feasible to use any modifiers as I need
plain/linear buffer to draw into by skia. I suspect when I create the
buffer with modifiers and then I map it to CPU memory for SW drawing,
intermediate buffer is created and then the pixels are de-composed back
to GPU memory.
No, I don't believe there is any kind of intermediate buffer behind the
scenes in GBM or dmabuf ioctls/mmap. OpenGL drivers may use
intermediate buffers. Some EGLImage operations are allowed to create
more buffers behind the scenes but I think implementations try to avoid
it. Copies are bad for performance, and implicit copies are unexpected
performance bottle-necks. So yes, I believe you very much need to
ensure the buffer gets allocated as linear from the start.
Some hardware may have hardware tiling units, that may be able to
represent a linear CPU view into a tiled buffer, but I know very little
of that. I think they might need driver-specific ioctls to use or
something, and are a scarce resource.
Thanks,
ma.
Thanks,
pq
--
Martin Stransky
Software Engineer / Red Hat, Inc
_______________________________________________
wayland-devel mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/wayland-devel