Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Pierre Ossman

On 15/06/2024 13:35, Marek Olšák wrote:

It's probably driver-specific. Some drivers might need glFlush before
you use gbm_bo_map because gbm might only wait for work that has been
flushed.



That would be needed on the "writing" side, right? So if I'm seeing 
issues when mapping for reading, then it would indicate a bug in the 
other peer? Which would be gnome-shell in my case.


Any way I could test this? Can I force extra syncs/flushes in some way 
and see if the issue goes away?


I tried adding a sleep of 10ms before reading the data, but did not see 
any improvement. Which would make sense if the commands are still 
sitting in an application buffer somewhere, rather than with the GPU.


Regards
--
Pierre Ossman   Software Development
Cendio AB   https://cendio.com
Teknikringen 8  https://twitter.com/ThinLinc
583 30 Linköpinghttps://facebook.com/ThinLinc
Phone: +46-13-214600

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Christian König

Am 17.06.24 um 09:32 schrieb Pierre Ossman:

On 15/06/2024 13:35, Marek Olšák wrote:

It's probably driver-specific. Some drivers might need glFlush before
you use gbm_bo_map because gbm might only wait for work that has been
flushed.



That would be needed on the "writing" side, right? So if I'm seeing 
issues when mapping for reading, then it would indicate a bug in the 
other peer? Which would be gnome-shell in my case.


Any way I could test this? Can I force extra syncs/flushes in some way 
and see if the issue goes away?


Well the primary question here is what do you want to wait for?

As Marek wrote GBM and the kernel can only see work which has been 
flushed and is not queued up inside the OpenGL library for example.


I tried adding a sleep of 10ms before reading the data, but did not 
see any improvement. Which would make sense if the commands are still 
sitting in an application buffer somewhere, rather than with the GPU.


Let me try to clarify a couple of things:

The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so 
that the GPU can see values written by the CPU and the CPU can see 
values written by the GPU. But that IOCTL does *not* wait for any async 
GPU operation to finish.


If you want to wait for async GPU operations you either need to call the 
OpenGL functions to read pixels or do a select() (or poll, epoll etc...) 
call on the DMA-buf file descriptor.


So if you want to do some rendering with OpenGL and then see the result 
in a buffer memory mapping the correct sequence would be the following:


1. Issue OpenGL rendering commands.
2. Call glFlush() to make sure the hw actually starts working on the 
rendering.
3. Call select() on the DMA-buf file descriptor to wait for the 
rendering to complete.

4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible.

Regards,
Christian.



Regards




Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Pierre Ossman

On 17/06/2024 10:13, Christian König wrote:


Let me try to clarify a couple of things:

The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so 
that the GPU can see values written by the CPU and the CPU can see 
values written by the GPU. But that IOCTL does *not* wait for any async 
GPU operation to finish.


If you want to wait for async GPU operations you either need to call the 
OpenGL functions to read pixels or do a select() (or poll, epoll etc...) 
call on the DMA-buf file descriptor.




Thanks for the clarification!

Just to avoid any uncertainty, are both of these things done implicitly 
by gbm_bo_map()/gbm_bo_unmap()?


I did test adding those steps just in case, but unfortunately did not 
see an improvement. My order was:


1. gbm_bo_import(GBM_BO_USE_RENDERING)
2. gbm_bo_get_fd()
3. Wait for client to request displaying the buffer
4. gbm_bo_map(GBM_BO_TRANSFER_READ)
5. select(fd+1, &fds, NULL, NULL, NULL)
6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | 
DMA_BUF_SYNC_READ })

7. pixman_blt()
8. gbm_bo_unmap()

So if you want to do some rendering with OpenGL and then see the result 
in a buffer memory mapping the correct sequence would be the following:


1. Issue OpenGL rendering commands.
2. Call glFlush() to make sure the hw actually starts working on the 
rendering.
3. Call select() on the DMA-buf file descriptor to wait for the 
rendering to complete.

4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible.



What I want to do is implement the X server side of DRI3 in just CPU. It 
works for every application I've tested except gnome-shell.


I would assume that 1. and 2. are supposed to be done by the X client, 
i.e. gnome-shell?


What I need to be able to do is access the result of that, once the X 
client tries to draw using that GBM backed pixmap (e.g. using 
PresentPixmap).


So far, we've only tested Intel GPUs, but we are setting up Nvidia and 
AMD GPUs at the moment. It will be interesting to see if the issue 
remains on those or not.


Regards
--
Pierre Ossman   Software Development
Cendio AB   https://cendio.com
Teknikringen 8  https://twitter.com/ThinLinc
583 30 Linköpinghttps://facebook.com/ThinLinc
Phone: +46-13-214600

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Christian König

Am 17.06.24 um 12:29 schrieb Pierre Ossman:

On 17/06/2024 10:13, Christian König wrote:


Let me try to clarify a couple of things:

The DMA_BUF_IOCTL_SYNC function is to flush and invalidate caches so 
that the GPU can see values written by the CPU and the CPU can see 
values written by the GPU. But that IOCTL does *not* wait for any 
async GPU operation to finish.


If you want to wait for async GPU operations you either need to call 
the OpenGL functions to read pixels or do a select() (or poll, epoll 
etc...) call on the DMA-buf file descriptor.




Thanks for the clarification!

Just to avoid any uncertainty, are both of these things done 
implicitly by gbm_bo_map()/gbm_bo_unmap()?


gbm_bo_map() is *not* doing any synchronization whatsoever as far as I 
know. It just does the steps necessary for the mmap().




I did test adding those steps just in case, but unfortunately did not 
see an improvement. My order was:


1. gbm_bo_import(GBM_BO_USE_RENDERING)
2. gbm_bo_get_fd()
3. Wait for client to request displaying the buffer
4. gbm_bo_map(GBM_BO_TRANSFER_READ)
5. select(fd+1, &fds, NULL, NULL, NULL)
6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | 
DMA_BUF_SYNC_READ })

7. pixman_blt()
8. gbm_bo_unmap()


At least of hand that looks like it should work.



So if you want to do some rendering with OpenGL and then see the 
result in a buffer memory mapping the correct sequence would be the 
following:


1. Issue OpenGL rendering commands.
2. Call glFlush() to make sure the hw actually starts working on the 
rendering.
3. Call select() on the DMA-buf file descriptor to wait for the 
rendering to complete.

4. Use DMA_BUF_IOCTL_SYNC to make the rendering result CPU visible.



What I want to do is implement the X server side of DRI3 in just CPU. 
It works for every application I've tested except gnome-shell.


I would assume that 1. and 2. are supposed to be done by the X client, 
i.e. gnome-shell?


Yes, exactly that.



What I need to be able to do is access the result of that, once the X 
client tries to draw using that GBM backed pixmap (e.g. using 
PresentPixmap).


No idea why that doesn't work.

Regards,
Christian.



So far, we've only tested Intel GPUs, but we are setting up Nvidia and 
AMD GPUs at the moment. It will be interesting to see if the issue 
remains on those or not.


Regards




Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Christian König

Am 17.06.24 um 16:50 schrieb Michel Dänzer:

On 2024-06-17 12:29, Pierre Ossman wrote:

Just to avoid any uncertainty, are both of these things done implicitly by 
gbm_bo_map()/gbm_bo_unmap()?

I did test adding those steps just in case, but unfortunately did not see an 
improvement. My order was:

1. gbm_bo_import(GBM_BO_USE_RENDERING)
2. gbm_bo_get_fd()
3. Wait for client to request displaying the buffer
4. gbm_bo_map(GBM_BO_TRANSFER_READ)
5. select(fd+1, &fds, NULL, NULL, NULL)

*If* select() is needed, it needs to be before gbm_bo_map(), because the latter 
may perform a blit from the real BO to a staging one for CPU access.


But don't you then need to wait for the blit to finish?

Regards,
Christian.





6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ 
})

gbm_bo_map() should do this internally if needed.



7. pixman_blt()
8. gbm_bo_unmap()






Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Michel Dänzer
On 2024-06-17 12:29, Pierre Ossman wrote:
>
> Just to avoid any uncertainty, are both of these things done implicitly by 
> gbm_bo_map()/gbm_bo_unmap()?
> 
> I did test adding those steps just in case, but unfortunately did not see an 
> improvement. My order was:
> 
> 1. gbm_bo_import(GBM_BO_USE_RENDERING)
> 2. gbm_bo_get_fd()
> 3. Wait for client to request displaying the buffer
> 4. gbm_bo_map(GBM_BO_TRANSFER_READ)
> 5. select(fd+1, &fds, NULL, NULL, NULL)

*If* select() is needed, it needs to be before gbm_bo_map(), because the latter 
may perform a blit from the real BO to a staging one for CPU access.


> 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | 
> DMA_BUF_SYNC_READ })

gbm_bo_map() should do this internally if needed.


> 7. pixman_blt()
> 8. gbm_bo_unmap()


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Christian König

Am 17.06.24 um 16:55 schrieb Michel Dänzer:

On 2024-06-17 16:52, Christian König wrote:

Am 17.06.24 um 16:50 schrieb Michel Dänzer:

On 2024-06-17 12:29, Pierre Ossman wrote:

Just to avoid any uncertainty, are both of these things done implicitly by 
gbm_bo_map()/gbm_bo_unmap()?

I did test adding those steps just in case, but unfortunately did not see an 
improvement. My order was:

1. gbm_bo_import(GBM_BO_USE_RENDERING)
2. gbm_bo_get_fd()
3. Wait for client to request displaying the buffer
4. gbm_bo_map(GBM_BO_TRANSFER_READ)
5. select(fd+1, &fds, NULL, NULL, NULL)

*If* select() is needed, it needs to be before gbm_bo_map(), because the latter 
may perform a blit from the real BO to a staging one for CPU access.

But don't you then need to wait for the blit to finish?

No, gbm_bo_map() must handle that internally. When it returns, the CPU must see 
the correct contents.


Ah, ok in that case that function does more than I expected.

Thanks,
Christian.





6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | DMA_BUF_SYNC_READ 
})

gbm_bo_map() should do this internally if needed.



7. pixman_blt()
8. gbm_bo_unmap()




Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Michel Dänzer
On 2024-06-17 16:52, Christian König wrote:
> Am 17.06.24 um 16:50 schrieb Michel Dänzer:
>> On 2024-06-17 12:29, Pierre Ossman wrote:
>>> Just to avoid any uncertainty, are both of these things done implicitly by 
>>> gbm_bo_map()/gbm_bo_unmap()?
>>>
>>> I did test adding those steps just in case, but unfortunately did not see 
>>> an improvement. My order was:
>>>
>>> 1. gbm_bo_import(GBM_BO_USE_RENDERING)
>>> 2. gbm_bo_get_fd()
>>> 3. Wait for client to request displaying the buffer
>>> 4. gbm_bo_map(GBM_BO_TRANSFER_READ)
>>> 5. select(fd+1, &fds, NULL, NULL, NULL)
>> *If* select() is needed, it needs to be before gbm_bo_map(), because the 
>> latter may perform a blit from the real BO to a staging one for CPU access.
> 
> But don't you then need to wait for the blit to finish?

No, gbm_bo_map() must handle that internally. When it returns, the CPU must see 
the correct contents.


>>> 6. ioctl(DMA_BUF_IOCTL_SYNC, &{ .flags = DMA_BUF_SYNC_START | 
>>> DMA_BUF_SYNC_READ })
>> gbm_bo_map() should do this internally if needed.
>>
>>
>>> 7. pixman_blt()
>>> 8. gbm_bo_unmap()
>>
> 

-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Pierre Ossman

On 17/06/2024 16:50, Michel Dänzer wrote:

On 2024-06-17 12:29, Pierre Ossman wrote:


Just to avoid any uncertainty, are both of these things done implicitly by 
gbm_bo_map()/gbm_bo_unmap()?

I did test adding those steps just in case, but unfortunately did not see an 
improvement. My order was:

1. gbm_bo_import(GBM_BO_USE_RENDERING)
2. gbm_bo_get_fd()
3. Wait for client to request displaying the buffer
4. gbm_bo_map(GBM_BO_TRANSFER_READ)
5. select(fd+1, &fds, NULL, NULL, NULL)


*If* select() is needed, it needs to be before gbm_bo_map(), because the latter 
may perform a blit from the real BO to a staging one for CPU access.



Can I know whether it is needed or not? Or should I be cautious and 
always do it?


I also assumed I should do select() with readfds set when I want to 
read, and writefds set when I want to write?


Still, after moving it before the map the issue unfortunately remains. :/

A recording of the issue is available here, in case the behaviour rings 
a bell for anyone:


http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm

(tried to include it as an attachment, but that email was filtered out 
somewhere)


Regards,
--
Pierre Ossman   Software Development
Cendio AB   https://cendio.com
Teknikringen 8  https://twitter.com/ThinLinc
583 30 Linköpinghttps://facebook.com/ThinLinc
Phone: +46-13-214600

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Michel Dänzer
On 2024-06-17 17:27, Pierre Ossman wrote:
> On 17/06/2024 16:50, Michel Dänzer wrote:
>> On 2024-06-17 12:29, Pierre Ossman wrote:
>>>
>>> Just to avoid any uncertainty, are both of these things done implicitly by 
>>> gbm_bo_map()/gbm_bo_unmap()?
>>>
>>> I did test adding those steps just in case, but unfortunately did not see 
>>> an improvement. My order was:
>>>
>>> 1. gbm_bo_import(GBM_BO_USE_RENDERING)
>>> 2. gbm_bo_get_fd()
>>> 3. Wait for client to request displaying the buffer
>>> 4. gbm_bo_map(GBM_BO_TRANSFER_READ)
>>> 5. select(fd+1, &fds, NULL, NULL, NULL)
>>
>> *If* select() is needed, it needs to be before gbm_bo_map(), because the 
>> latter may perform a blit from the real BO to a staging one for CPU access.
>>
> 
> Can I know whether it is needed or not? Or should I be cautious and always do 
> it?

Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be 
needed.


> A recording of the issue is available here, in case the behaviour rings a 
> bell for anyone:
> 
> http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm

Interesting. Looks like the surroundings (drop shadow region?) of the window 
move along with it first, then the surroundings get fixed up in the next frame.

As far as I know, mutter doesn't move window contents like that on the client 
side; it always redraws the damaged output region from scratch. So I wonder if 
the initial move together with surroundings is actually a blit on the X server 
side (possibly triggered by mutter moving the X window in its function as 
window manager). And then the surroundings fixing themselves up is the correct 
output from mutter via DRI3/Present.

If so, the issue isn't synchronization, it's that the first blit happens at all.


-- 
Earthling Michel Dänzer|  https://redhat.com
Libre software enthusiast  | Mesa and Xwayland developer



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Pierre Ossman

On 17/06/2024 18:09, Michel Dänzer wrote:


Can I know whether it is needed or not? Or should I be cautious and always do 
it?


Assuming GBM in the X server uses the GPU HW driver, I'd say it shouldn't be 
needed.



It does not (except the driver libgbm loads). We're trying to use this 
in Xvnc, so it's all CPU. We're just trying to make sure the 
applications can use the full power of the GPU to render their stuff 
before handing it over to the X server. :)





A recording of the issue is available here, in case the behaviour rings a bell 
for anyone:

http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm


Interesting. Looks like the surroundings (drop shadow region?) of the window 
move along with it first, then the surroundings get fixed up in the next frame.

As far as I know, mutter doesn't move window contents like that on the client 
side; it always redraws the damaged output region from scratch. So I wonder if 
the initial move together with surroundings is actually a blit on the X server 
side (possibly triggered by mutter moving the X window in its function as 
window manager). And then the surroundings fixing themselves up is the correct 
output from mutter via DRI3/Present.

If so, the issue isn't synchronization, it's that the first blit happens at all.



Hmm... The source of the blit is CopyWindow being called as a result of 
the window moving. But I would have expected that to be inhibited by the 
fact that a compositor is active. It's also surprising that this only 
happens if DRI3 is involved.


I would also have expected something similar with software rendering. 
Albeit with a PutImage instead of PresentPixmap for the correct data. 
But everything works there.


I will need to dig further.

Regards,
--
Pierre Ossman   Software Development
Cendio AB   http://cendio.com
Teknikringen 8  http://twitter.com/ThinLinc
583 30 Linköpinghttp://facebook.com/ThinLinc
Phone: +46-13-214600

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Christian König

Am 17.06.24 um 19:18 schrieb Pierre Ossman:

On 17/06/2024 18:09, Michel Dänzer wrote:


Can I know whether it is needed or not? Or should I be cautious and 
always do it?


Assuming GBM in the X server uses the GPU HW driver, I'd say it 
shouldn't be needed.




It does not (except the driver libgbm loads). We're trying to use this 
in Xvnc, so it's all CPU. We're just trying to make sure the 
applications can use the full power of the GPU to render their stuff 
before handing it over to the X server. :)


That whole approach won't work.

When you don't have a HW driver loaded or at least tell the client that 
it should render into a linear buffer somehow then the data in the 
buffer will be tilled in a hw specific format.


As far as I know you can't read that vendor agnostic with the CPU, you 
need the hw driver for that.


Regards,
Christian.





A recording of the issue is available here, in case the behaviour 
rings a bell for anyone:


http://www.cendio.com/~ossman/dri3/Screencast%20from%202024-06-17%2017-06-50.webm 



Interesting. Looks like the surroundings (drop shadow region?) of the 
window move along with it first, then the surroundings get fixed up 
in the next frame.


As far as I know, mutter doesn't move window contents like that on 
the client side; it always redraws the damaged output region from 
scratch. So I wonder if the initial move together with surroundings 
is actually a blit on the X server side (possibly triggered by mutter 
moving the X window in its function as window manager). And then the 
surroundings fixing themselves up is the correct output from mutter 
via DRI3/Present.


If so, the issue isn't synchronization, it's that the first blit 
happens at all.




Hmm... The source of the blit is CopyWindow being called as a result 
of the window moving. But I would have expected that to be inhibited 
by the fact that a compositor is active. It's also surprising that 
this only happens if DRI3 is involved.


I would also have expected something similar with software rendering. 
Albeit with a PutImage instead of PresentPixmap for the correct data. 
But everything works there.


I will need to dig further.

Regards,




Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Pierre Ossman

On 17/06/2024 20:18, Christian König wrote:

Am 17.06.24 um 19:18 schrieb Pierre Ossman:

On 17/06/2024 18:09, Michel Dänzer wrote:


Can I know whether it is needed or not? Or should I be cautious and 
always do it?


Assuming GBM in the X server uses the GPU HW driver, I'd say it 
shouldn't be needed.




It does not (except the driver libgbm loads). We're trying to use this 
in Xvnc, so it's all CPU. We're just trying to make sure the 
applications can use the full power of the GPU to render their stuff 
before handing it over to the X server. :)


That whole approach won't work.

When you don't have a HW driver loaded or at least tell the client that 
it should render into a linear buffer somehow then the data in the 
buffer will be tilled in a hw specific format.


As far as I know you can't read that vendor agnostic with the CPU, you 
need the hw driver for that.




I'm confused. What's the goal of the GBM abstraction and specifically 
gbm_bo_map() if it's not a hardware-agnostic way of accessing buffers?


In practice, we are getting linear buffers. At least on Intel and AMD 
GPUs. Nvidia are being a bit difficult getting GBM working, so we 
haven't tested that yet.


I see there is the GBM_BO_USE_LINEAR flag. We have not used it yet, as 
we haven't seen a need for it. What is the effect of that? Would it 
guarantee what we are just lucky to see at the moment?


Regards
--
Pierre Ossman   Software Development
Cendio AB   http://cendio.com
Teknikringen 8  http://twitter.com/ThinLinc
583 30 Linköpinghttp://facebook.com/ThinLinc
Phone: +46-13-214600

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?



Re: Does gbm_bo_map() implicitly synchronise?

2024-06-17 Thread Christian König

Am 18.06.24 um 07:01 schrieb Pierre Ossman:

On 17/06/2024 20:18, Christian König wrote:

Am 17.06.24 um 19:18 schrieb Pierre Ossman:

On 17/06/2024 18:09, Michel Dänzer wrote:


Can I know whether it is needed or not? Or should I be cautious 
and always do it?


Assuming GBM in the X server uses the GPU HW driver, I'd say it 
shouldn't be needed.




It does not (except the driver libgbm loads). We're trying to use 
this in Xvnc, so it's all CPU. We're just trying to make sure the 
applications can use the full power of the GPU to render their stuff 
before handing it over to the X server. :)


That whole approach won't work.

When you don't have a HW driver loaded or at least tell the client 
that it should render into a linear buffer somehow then the data in 
the buffer will be tilled in a hw specific format.


As far as I know you can't read that vendor agnostic with the CPU, 
you need the hw driver for that.




I'm confused. What's the goal of the GBM abstraction and specifically 
gbm_bo_map() if it's not a hardware-agnostic way of accessing buffers?


There is no hardware agnostic way of accessing buffers which contain hw 
specific data.


You always need a hw specific backend for that or use the linear flag 
which makes the data hw agnostic.




In practice, we are getting linear buffers. At least on Intel and AMD 
GPUs. Nvidia are being a bit difficult getting GBM working, so we 
haven't tested that yet.


That's either because you have a linear buffer for some reason or the 
hardware specific gbm backend has inserted a blit as Michel described.


I see there is the GBM_BO_USE_LINEAR flag. We have not used it yet, as 
we haven't seen a need for it. What is the effect of that? Would it 
guarantee what we are just lucky to see at the moment?


Michel and/or Marek need to answer that. I'm coming from the kernel side 
and maintaining the DMA-buf implementation backing all this, but I'm not 
an expert on gbm.


Regards,
Christian.



Regards