On Thu, 8 Jun 2023 08:55:51 +0200
Boris Brezillon <[email protected]> wrote:
> If I understand correctly, drm_sched_entity_kill_jobs_cb() is supposed
> to wait on all the external dependencies (those added to
> drm_sched_job::dependencies) before signaling the job finished fence.
> This is done this way to prevent jobs depending on these canceled jobs
> from considering the resources they want to access as ready, when
> they're actually still used by other components, thus leading to
> potential data corruptions.
>
> The problem is, the kill_jobs logic is omitting the last fence popped
> from the dependencies array that was waited upon before
> drm_sched_entity_kill() was called (drm_sched_entity::dependency field),
> so we're basically waiting for all dependencies except one.
>
> This is an attempt at fixing that, but also an opportunity to make sure
> I understood the drm_sched_entity_kill(), because I'm not 100% sure if
^ the drm_sched_entity_kill() logic correctly, ...
> skipping this currently popped dependency was intentional or not. I can't
> see a good reason why we'd want to skip that one, but I might be missing
> something.
>
> Signed-off-by: Boris Brezillon <[email protected]>
> Cc: Frank Binns <[email protected]>
> Cc: Sarah Walker <[email protected]>
> Cc: Donald Robson <[email protected]>
> Cc: Luben Tuikov <[email protected]>
> Cc: David Airlie <[email protected]>
> Cc: Daniel Vetter <[email protected]>
> Cc: Sumit Semwal <[email protected]>
> Cc: "Christian König" <[email protected]>
> ---
> Stumbled across this issue while working on native dependency support
> with Donald (which will be posted soon). Flagged as RFC, because I'm
> not sure this is legit, and also not sure we want to fix it this way.
> I tried re-using drm_sched_entity::dependency, but it's a bit of a mess
> because of the asynchronousity of the wait, and the fact we use
> drm_sched_entity::dependency to know if we have a clear_dep()
> callback registered, so we can easily reset it without removing the
^ we can't ...
> callback.