Ah, yes I've talked about that topic with Michel just last week on XDC. It would make sense to have a generic interface to query the errors so that the display manager/compositor can do something reasonable when an application messes up its rendering.
E.g. display a an error message instead of just a window full of random pixels. On 09.10.25 22:35, Dave Airlie wrote: > Just adding Christian and Faith, who might have some more comments. > > On Fri, 10 Oct 2025 at 06:04, Zack Rusin <[email protected]> wrote: >> >> Propagate the fence errors from drivers to userspace. Allows userspace to >> react to asynchronous errors coming from the drivers. >> >> One of the trickiest bits of drm syncobj interface is that, unexpectedly, >> the syncobj doesn't propagate the fence errors on wait. Whenever something >> goes wrong in an asynchronous task/job that uses drm syncobj to >> communicate with the userspace there's no way to convey that issue >> with userspace as drm syncobj wait function will only check whether >> a fence has been signaled but not whether it has been signaled without >> error. >> >> Instead of assuming that a signaled fence implies success grab the actual >> status of the fence and return the first fence error that has been >> spotted. Return the first error because all the subsequent errors are >> likely to be caused by the initial error in a chain of tasks. >> >> [RFC]: Some drivers (e.g. Xe) do accept drm syncobj's in the vm_bind >> and exec interface, they also call dma_fence_set_error when those >> operations asynchronously fail, currently those errors will just be >> silently ignored (because they don't propagate), I'm not sure how the >> userspace written for those drivers will react to actually receiving >> those errors, even if silently dropping those driver errors seems >> completely wrong to me. IIRC during the initial drm_syncobj or timeline bringup we had a brief discussion if we should do this on wait and then decided against it. The wait functionality in both sync_file as well as DMA-buf file descriptor doesn't bubble up the error on wait either. Instead the sync_file has an SYNC_IOC_FILE_INFO IOCTL to query the result of the operation separately after the wait is completed. Amdgpu, Nouveau and i915 have functions to do this in a driver specific ways. I think we should just add an DRM_IOCTL_SYNCOBJ_ERRNO IOCTL (feel free to come up with a better name) to query the potential error from a timeline sync point after waiting for it has completed. One problem could be that fences with errors are garbage collected on a timeline before we have a chance to return the error code to userspace, but in this case I think we can just propagate the error through the timeline. Regards, Christian. >> >> Signed-off-by: Zack Rusin <[email protected]> >> Cc: [email protected] >> Cc: David Airlie <[email protected]> >> Cc: Simona Vetter <[email protected]> >> Cc: Maarten Lankhorst <[email protected]> >> Cc: Maxime Ripard <[email protected]> >> Cc: Thomas Zimmermann <[email protected]> >> --- >> drivers/gpu/drm/drm_syncobj.c | 12 ++++++++++++ >> 1 file changed, 12 insertions(+) >> >> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c >> index e1b0fa4000cd..bcd8eff8b59a 100644 >> --- a/drivers/gpu/drm/drm_syncobj.c >> +++ b/drivers/gpu/drm/drm_syncobj.c >> @@ -1067,6 +1067,7 @@ static signed long >> drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs, >> struct dma_fence *fence; >> uint64_t *points; >> uint32_t signaled_count, i; >> + int fence_status, first_fence_error = 0; >> >> if (flags & (DRM_SYNCOBJ_WAIT_FLAGS_WAIT_FOR_SUBMIT | >> DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE)) { >> @@ -1170,6 +1171,9 @@ static signed long >> drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs, >> dma_fence_add_callback(fence, >> &entries[i].fence_cb, >> >> syncobj_wait_fence_func))) { >> + fence_status = dma_fence_get_status(fence); >> + if (fence_status < 0 && !first_fence_error) >> + first_fence_error = fence_status; >> /* The fence has been signaled */ >> if (flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_ALL) >> { >> signaled_count++; >> @@ -1213,6 +1217,14 @@ static signed long >> drm_syncobj_array_wait_timeout(struct drm_syncobj **syncobjs, >> err_free_points: >> kfree(points); >> >> + /* >> + * Propagate the last fence error the code has seen, but >> + * give precedence to the overall wait error in case one >> + * was encountered. >> + */ >> + if (first_fence_error < 0 && timeout >= 0) >> + timeout = first_fence_error; >> + >> return timeout; >> } >> >> -- >> 2.48.1 >>
