Ah, yes I've talked about that topic with Michel just last week on XDC.

It would make sense to have a generic interface to query the errors so that the 
display manager/compositor can do something reasonable when an application 
messes up its rendering.

E.g. display a an error message instead of just a window full of random pixels.

On 09.10.25 22:35, Dave Airlie wrote:
> Just adding Christian and Faith, who might have some more comments.
> 
> On Fri, 10 Oct 2025 at 06:04, Zack Rusin <[email protected]> wrote:
>>
>> Propagate the fence errors from drivers to userspace. Allows userspace to
>> react to asynchronous errors coming from the drivers.
>>
>> One of the trickiest bits of drm syncobj interface is that, unexpectedly,
>> the syncobj doesn't propagate the fence errors on wait. Whenever something
>> goes wrong in an asynchronous task/job that uses drm syncobj to
>> communicate with the userspace there's no way to convey that issue
>> with userspace as drm syncobj wait function will only check whether
>> a fence has been signaled but not whether it has been signaled without
>> error.
>>
>> Instead of assuming that a signaled fence implies success grab the actual
>> status of the fence and return the first fence error that has been
>> spotted. Return the first error because all the subsequent errors are
>> likely to be caused by the initial error in a chain of tasks.
>>
>> [RFC]: Some drivers (e.g. Xe) do accept drm syncobj's in the vm_bind
>> and exec interface, they also call dma_fence_set_error when those
>> operations asynchronously fail, currently those errors will just be
>> silently ignored (because they don't propagate), I'm not sure how the
>> userspace written for those drivers will react to actually receiving
>> those errors, even if silently dropping those driver errors seems
>> completely wrong to me.

IIRC during the initial drm_syncobj or timeline bringup we had a brief 
discussion if we should do this on wait and then decided against it.

The wait functionality in both sync_file as well as DMA-buf file descriptor 
doesn't bubble up the error on wait either.

Instead the sync_file has an SYNC_IOC_FILE_INFO IOCTL to query the result of 
the operation separately after the wait is completed.

Amdgpu, Nouveau and i915 have functions to do this in a driver specific ways.

I think we should just add an DRM_IOCTL_SYNCOBJ_ERRNO IOCTL (feel free to come 
up with a better name) to query the potential error from a timeline sync point 
after waiting for it has completed.

One problem could be that fences with errors are garbage collected on a 
timeline before we have a chance to return the error code to userspace, but in 
this case I think we can just propagate the error through the timeline.

Regards,
Christian.

>>
>> Signed-off-by: Zack Rusin <[email protected]>
>> Cc: [email protected]
>> Cc: David Airlie <[email protected]>
>> Cc: Simona Vetter <[email protected]>
>> Cc: Maarten Lankhorst <[email protected]>
>> Cc: Maxime Ripard <[email protected]>
>> Cc: Thomas Zimmermann <[email protected]>
>> ---
>>  drivers/gpu/drm/drm_syncobj.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
>> index e1b0fa4000cd..bcd8eff8b59a 100644
>> --- a/drivers/gpu/drm/drm_syncobj.c
>> +++ b/drivers/gpu/drm/drm_syncobj.c
>> @@ -1067,6 +1067,7 @@ static signed long 
>> drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
>>         struct dma_fence *fence;
>>         uint64_t *points;
>>         uint32_t signaled_count, i;
>> +       int fence_status, first_fence_error = 0;
>>
>>         if (flags & (DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT |
>>                      DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE)) {
>> @@ -1170,6 +1171,9 @@ static signed long 
>> drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
>>                              dma_fence_add_callback(fence,
>>                                                     &entries[i].fence_cb,
>>                                                     
>> syncobj_wait_fence_func))) {
>> +                               fence_status = dma_fence_get_status(fence);
>> +                               if (fence_status < 0 && !first_fence_error)
>> +                                       first_fence_error = fence_status;
>>                                 /* The fence has been signaled */
>>                                 if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_ALL) 
>> {
>>                                         signaled_count++;
>> @@ -1213,6 +1217,14 @@ static signed long 
>> drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs,
>>  err_free_points:
>>         kfree(points);
>>
>> +       /*
>> +        * Propagate the last fence error the code has seen, but
>> +        * give precedence to the overall wait error in case one
>> +        * was encountered.
>> +        */
>> +       if (first_fence_error < 0 && timeout >= 0)
>> +               timeout = first_fence_error;
>> +
>>         return timeout;
>>  }
>>
>> --
>> 2.48.1
>>

Reply via email to