Re: Does gbm_bo_map() implicitly synchronise?
On 15/06/2024 13:35, Marek Olšák wrote: It's probably driver-specific. Some drivers might need glFlush before you use gbm_bo_map because gbm might only wait for work that has been flushed. That would be needed on the "writing" side, right? So if I'm seeing issues when mapping for reading, then it would indicate a bug in the other peer? Which would be gnome-shell in my case. Any way I could test this? Can I force extra syncs/flushes in some way and see if the issue goes away? I tried adding a sleep of 10ms before reading the data, but did not see any improvement. Which would make sense if the commands are still sitting in an application buffer somewhere, rather than with the GPU. Regards -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 09:32 schrieb Pierre Ossman: On 15/06/2024 13:35, Marek Olšák wrote: It's probably driver-specific. Some drivers might need glFlush before you use gbm_bo_map because gbm might only wait for work that has been flushed. That would be needed on the "writing" side, right? So if I'm seeing issues when mapping for reading, then it would indicate a bug in the other peer? Which would be gnome-shell in my case. Any way I could test this? Can I force extra syncs/flushes in some way and see if the issue goes away? Well the primary question here is what do you want to wait for? As Marek wrote GBM and the kernel can only see work which has been flushed and is not queued up inside the OpenGL library for example. I tried adding a sleep of 10ms before reading the data, but did not see any improvement. Which would make sense if the commands are still sitting in an application buffer somewhere, rather than with the GPU. Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. Regards, Christian. Regards
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 10:13, Christian König wrote: Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. Thanks for the clarification! Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) 7. pixman_blt() 8. gbm_bo_unmap() So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 12:29 schrieb Pierre Ossman: On 17/06/2024 10:13, Christian König wrote: Let me try to clarify a couple of things: The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so that the GPU can see values written by the CPU and the CPU can see values written by the GPU. But that IOCTL does *not* wait for any async GPU operation to finish. If you want to wait for async GPU operations you either need to call the OpenGL functions to read pixels or do a select() (or poll, epoll etc...) call on the DMA-buf file descriptor. Thanks for the clarification! Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? gbm_bo_map() is *not* doing any synchronization whatsoever as far as I know. It just does the steps necessary for the mmap(). I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) 7. pixman_blt() 8. gbm_bo_unmap() At least of hand that looks like it should work. So if you want to do some rendering with OpenGL and then see the result in a buffer memory mapping the correct sequence would be the following: 1. Issue OpenGL rendering commands. 2. Call glFlush() to make sure the hw actually starts working on the rendering. 3. Call select() on the DMA-buf file descriptor to wait for the rendering to complete. 4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible. What I want to do is implement the X server side of DRI3 in just CPU. It works for every application I've tested except gnome-shell. I would assume that 1. and 2. are supposed to be done by the X client, i.e. gnome-shell? Yes, exactly that. What I need to be able to do is access the result of that, once the X client tries to draw using that GBM backed pixmap (e.g. using PresentPixmap). No idea why that doesn't work. Regards, Christian. So far, we've only tested Intel GPUs, but we are setting up Nvidia and AMD GPUs at the moment. It will be interesting to see if the issue remains on those or not. Regards
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 16:50 schrieb Michel Dänzer: On 2024-06-17 12:29, Pierre Ossman wrote: Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. But don't you then need to wait for the blit to finish? Regards, Christian. 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) gbm_bo_map() should do this internally if needed. 7. pixman_blt() 8. gbm_bo_unmap()
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 12:29, Pierre Ossman wrote: > > Just to avoid any uncertainty, are both of these things done implicitly by > gbm_bo_map()/gbm_bo_unmap()? > > I did test adding those steps just in case, but unfortunately did not see an > improvement. My order was: > > 1. gbm_bo_import(GBM_BO_USE_RENDERING) > 2. gbm_bo_get_fd() > 3. Wait for client to request displaying the buffer > 4. gbm_bo_map(GBM_BO_TRANSFER_READ) > 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. > 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | > DMA_BUF_SYNC_READ }) gbm_bo_map() should do this internally if needed. > 7. pixman_blt() > 8. gbm_bo_unmap() -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 16:55 schrieb Michel Dänzer: On 2024-06-17 16:52, Christian König wrote: Am 17.06.24 um 16:50 schrieb Michel Dänzer: On 2024-06-17 12:29, Pierre Ossman wrote: Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. But don't you then need to wait for the blit to finish? No, gbm_bo_map() must handle that internally. When it returns, the CPU must see the correct contents. Ah, ok in that case that function does more than I expected. Thanks, Christian. 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ }) gbm_bo_map() should do this internally if needed. 7. pixman_blt() 8. gbm_bo_unmap()
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 16:52, Christian König wrote: > Am 17.06.24 um 16:50 schrieb Michel Dänzer: >> On 2024-06-17 12:29, Pierre Ossman wrote: >>> Just to avoid any uncertainty, are both of these things done implicitly by >>> gbm_bo_map()/gbm_bo_unmap()? >>> >>> I did test adding those steps just in case, but unfortunately did not see >>> an improvement. My order was: >>> >>> 1. gbm_bo_import(GBM_BO_USE_RENDERING) >>> 2. gbm_bo_get_fd() >>> 3. Wait for client to request displaying the buffer >>> 4. gbm_bo_map(GBM_BO_TRANSFER_READ) >>> 5. select(fd+1, &fds, NULL, NULL, NULL) >> *If* select() is needed, it needs to be before gbm_bo_map(), because the >> latter may perform a blit from the real BO to a staging one for CPU access. > > But don't you then need to wait for the blit to finish? No, gbm_bo_map() must handle that internally. When it returns, the CPU must see the correct contents. >>> 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | >>> DMA_BUF_SYNC_READ }) >> gbm_bo_map() should do this internally if needed. >> >> >>> 7. pixman_blt() >>> 8. gbm_bo_unmap() >> > -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 16:50, Michel Dänzer wrote: On 2024-06-17 12:29, Pierre Ossman wrote: Just to avoid any uncertainty, are both of these things done implicitly by gbm_bo_map()/gbm_bo_unmap()? I did test adding those steps just in case, but unfortunately did not see an improvement. My order was: 1. gbm_bo_import(GBM_BO_USE_RENDERING) 2. gbm_bo_get_fd() 3. Wait for client to request displaying the buffer 4. gbm_bo_map(GBM_BO_TRANSFER_READ) 5. select(fd+1, &fds, NULL, NULL, NULL) *If* select() is needed, it needs to be before gbm_bo_map(), because the latter may perform a blit from the real BO to a staging one for CPU access. Can I know whether it is needed or not? Or should I be cautious and always do it? I also assumed I should do select() with readfds set when I want to read, and writefds set when I want to write? Still, after moving it before the map the issue unfortunately remains. :/ A recording of the issue is available here, in case the behaviour rings a bell for anyone: http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm (tried to include it as an attachment, but that email was filtered out somewhere) Regards, -- Pierre Ossman Software Development Cendio AB https://cendio.com Teknikringen 8 https://twitter.com/ThinLinc 583 30 Linköpinghttps://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
On 2024-06-17 17:27, Pierre Ossman wrote: > On 17/06/2024 16:50, Michel Dänzer wrote: >> On 2024-06-17 12:29, Pierre Ossman wrote: >>> >>> Just to avoid any uncertainty, are both of these things done implicitly by >>> gbm_bo_map()/gbm_bo_unmap()? >>> >>> I did test adding those steps just in case, but unfortunately did not see >>> an improvement. My order was: >>> >>> 1. gbm_bo_import(GBM_BO_USE_RENDERING) >>> 2. gbm_bo_get_fd() >>> 3. Wait for client to request displaying the buffer >>> 4. gbm_bo_map(GBM_BO_TRANSFER_READ) >>> 5. select(fd+1, &fds, NULL, NULL, NULL) >> >> *If* select() is needed, it needs to be before gbm_bo_map(), because the >> latter may perform a blit from the real BO to a staging one for CPU access. >> > > Can I know whether it is needed or not? Or should I be cautious and always do > it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. > A recording of the issue is available here, in case the behaviour rings a > bell for anyone: > > http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm Interesting. Looks like the surroundings (drop shadow region?) of the window move along with it first, then the surroundings get fixed up in the next frame. As far as I know, mutter doesn't move window contents like that on the client side; it always redraws the damaged output region from scratch. So I wonder if the initial move together with surroundings is actually a blit on the X server side (possibly triggered by mutter moving the X window in its function as window manager). And then the surroundings fixing themselves up is the correct output from mutter via DRI3/Present. If so, the issue isn't synchronization, it's that the first blit happens at all. -- Earthling Michel Dänzer| https://redhat.com Libre software enthusiast | Mesa and Xwayland developer
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) A recording of the issue is available here, in case the behaviour rings a bell for anyone: http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm Interesting. Looks like the surroundings (drop shadow region?) of the window move along with it first, then the surroundings get fixed up in the next frame. As far as I know, mutter doesn't move window contents like that on the client side; it always redraws the damaged output region from scratch. So I wonder if the initial move together with surroundings is actually a blit on the X server side (possibly triggered by mutter moving the X window in its function as window manager). And then the surroundings fixing themselves up is the correct output from mutter via DRI3/Present. If so, the issue isn't synchronization, it's that the first blit happens at all. Hmm... The source of the blit is CopyWindow being called as a result of the window moving. But I would have expected that to be inhibited by the fact that a compositor is active. It's also surprising that this only happens if DRI3 is involved. I would also have expected something similar with software rendering. Albeit with a PutImage instead of PresentPixmap for the correct data. But everything works there. I will need to dig further. Regards, -- Pierre Ossman Software Development Cendio AB http://cendio.com Teknikringen 8 http://twitter.com/ThinLinc 583 30 Linköpinghttp://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 17.06.24 um 19:18 schrieb Pierre Ossman: On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) That whole approach won't work. When you don't have a HW driver loaded or at least tell the client that it should render into a linear buffer somehow then the data in the buffer will be tilled in a hw specific format. As far as I know you can't read that vendor agnostic with the CPU, you need the hw driver for that. Regards, Christian. A recording of the issue is available here, in case the behaviour rings a bell for anyone: http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm Interesting. Looks like the surroundings (drop shadow region?) of the window move along with it first, then the surroundings get fixed up in the next frame. As far as I know, mutter doesn't move window contents like that on the client side; it always redraws the damaged output region from scratch. So I wonder if the initial move together with surroundings is actually a blit on the X server side (possibly triggered by mutter moving the X window in its function as window manager). And then the surroundings fixing themselves up is the correct output from mutter via DRI3/Present. If so, the issue isn't synchronization, it's that the first blit happens at all. Hmm... The source of the blit is CopyWindow being called as a result of the window moving. But I would have expected that to be inhibited by the fact that a compositor is active. It's also surprising that this only happens if DRI3 is involved. I would also have expected something similar with software rendering. Albeit with a PutImage instead of PresentPixmap for the correct data. But everything works there. I will need to dig further. Regards,
Re: Does gbm_bo_map() implicitly synchronise?
On 17/06/2024 20:18, Christian König wrote: Am 17.06.24 um 19:18 schrieb Pierre Ossman: On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) That whole approach won't work. When you don't have a HW driver loaded or at least tell the client that it should render into a linear buffer somehow then the data in the buffer will be tilled in a hw specific format. As far as I know you can't read that vendor agnostic with the CPU, you need the hw driver for that. I'm confused. What's the goal of the GBM abstraction and specifically gbm_bo_map() if it's not a hardware-agnostic way of accessing buffers? In practice, we are getting linear buffers. At least on Intel and AMD GPUs. Nvidia are being a bit difficult getting GBM working, so we haven't tested that yet. I see there is the GBM_BO_USE_LINEAR flag. We have not used it yet, as we haven't seen a need for it. What is the effect of that? Would it guarantee what we are just lucky to see at the moment? Regards -- Pierre Ossman Software Development Cendio AB http://cendio.com Teknikringen 8 http://twitter.com/ThinLinc 583 30 Linköpinghttp://facebook.com/ThinLinc Phone: +46-13-214600 A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
Re: Does gbm_bo_map() implicitly synchronise?
Am 18.06.24 um 07:01 schrieb Pierre Ossman: On 17/06/2024 20:18, Christian König wrote: Am 17.06.24 um 19:18 schrieb Pierre Ossman: On 17/06/2024 18:09, Michel Dänzer wrote: Can I know whether it is needed or not? Or should I be cautious and always do it? Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be needed. It does not (except the driver libgbm loads). We're trying to use this in Xvnc, so it's all CPU. We're just trying to make sure the applications can use the full power of the GPU to render their stuff before handing it over to the X server. :) That whole approach won't work. When you don't have a HW driver loaded or at least tell the client that it should render into a linear buffer somehow then the data in the buffer will be tilled in a hw specific format. As far as I know you can't read that vendor agnostic with the CPU, you need the hw driver for that. I'm confused. What's the goal of the GBM abstraction and specifically gbm_bo_map() if it's not a hardware-agnostic way of accessing buffers? There is no hardware agnostic way of accessing buffers which contain hw specific data. You always need a hw specific backend for that or use the linear flag which makes the data hw agnostic. In practice, we are getting linear buffers. At least on Intel and AMD GPUs. Nvidia are being a bit difficult getting GBM working, so we haven't tested that yet. That's either because you have a linear buffer for some reason or the hardware specific gbm backend has inserted a blit as Michel described. I see there is the GBM_BO_USE_LINEAR flag. We have not used it yet, as we haven't seen a need for it. What is the effect of that? Would it guarantee what we are just lucky to see at the moment? Michel and/or Marek need to answer that. I'm coming from the kernel side and maintaining the DMA-buf implementation backing all this, but I'm not an expert on gbm. Regards, Christian. Regards