Re: 回复: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Andrey Grodzovsky Sun, 28 Feb 2021 06:00:04 -0800


On 2021-02-27 11:10 p.m., Liu, Monk wrote:

[AMD Official Use Only - Internal Distribution Only]
>>So gfx job hangs because it has a dependency on buggy compute jobwhich already is hanging ?
No, there is no dependency between this gfx job and that compute jobfrom a software perspective , but the CU is shared thus gfx isaffected by the bug from that compute job
>>I am still missing something - we don't ever delete bad jobs or anyjobs until they are signaled, we reinsert the bad job back intomirror list in drm_sched_stop
Oh yeah, it was still kept in the mirror list, I thought it wasremoved once for good in scheduler…
then my question is why we need to remove it in scheduler part if wealways need to reinsert it back?

See explanation in the original fix in this commit 'drm/scheduler: Avoidaccessing freed bad job.' - the problem with that fix was that while itsolved the original race issue it created another issue where if thedriver was prematurely terminating the reset process due to guilty jobalready being signaled (optimization - like we have in non amdgpudrivers) OR reset lock contention from multiple TDR threads (like wehave in amdgpu) then indeed we would remove the bad job but would notreinsert back as we would skip drm_sched_stop. For which issue Lubenproposed a state machine approach to the entire job life cycle handling(can't find the patch-set now) but during which review it was decidedthat the optimal approach would be to stop relying on the job and startrelying on the entity->finish_fence to keep all the info (What Christianmentions in the beginning of this thread).

And even for other vendors the better way is still

let vendor driver decide the heading job.
The real issue we hit is : sometimes if we run a quark test (it hangskcq ring with a bad shader inside), X server will occasionally crashwith a GFX ring TDR report
Root cause is still what I described before: both this innocent gfxjob and the guilty compute job are all marked as “guilty” by ourdriver, so even they are re-inserted back to mirror list
But they are all abandoned in drm_sched_resubmit_jobs() due to theyare all processed by drm_sched_increase_karma()

I see now, in this case the main issue is indeed that we cannot rely onhead job in mirror list to be the actual bad and guilty job and thisthen requires some redesign (e.g. along the lines of what you suggested).


Andrey

*发件人:*Grodzovsky, Andrey <[email protected]>
*发送时间:*2021年2月28日8:55

*收件人:*Liu, Monk <[email protected]>; Koenig, Christian<[email protected]>; [email protected]*抄送:*Zhang, Andy <[email protected]>; Chen, Horace<[email protected]>; Zhang, Jack (Jian) <[email protected]>*主题:*Re: 回复: [RFC] a new approach to detect which ring is the realblack sheep upon TDR reported


On 2021-02-26 10:56 p.m., Liu, Monk wrote:

    [AMD Official Use Only - Internal Distribution Only]

    H Andrey

    The scenario I hit here is not the one you mentioned, let me
    explain it with more details by another much easier understood
    example:

    Consider ring you have a job1 on KCQ, but the timeout of KCQ is 60
    seconds (just for example)

    You also have a job2 on GFX ring, and the timeout of GFX is 2 seconds

    We submit job1 first, and assume job1 have bug and it will cause
    shader hang very very soon

    After 10 seconds we submit job2, since KCQ have 60 seconds to
    report TDR thus SW know nothing about the engine already hang

So gfx job hangs because it has a dependency on buggy compute jobwhich already is hanging ?


    After 2 seconds we got TDR report from job2 on GFX ring,
    sched_job_timeout() think the leading job of GFX ring is the black
    sheep so it is deleted from the mirror list

    But in fact this job1 is innocent, and we should insert it back
    after recovery , and due to it was already deleted this innocent
    job’s context/process is really harmed

I am still missing something - we don't ever delete bad jobs or anyjobs until they are signaled, we reinsert the bad job back intomirror list in drm_sched_stop(here -https://elixir.bootlin.com/linux/v5.11.1/source/drivers/gpu/drm/scheduler/sched_main.c#L385)after sched thread is stopped and continue with the reset procedure.


Andrey

    Hope above example helps

    Thanks

    *发件人:*Grodzovsky, Andrey <[email protected]>
    <mailto:[email protected]>
    *发送时间:*2021年2月27日0:50
    *收件人:*Liu, Monk <[email protected]> <mailto:[email protected]>;
    Koenig, Christian <[email protected]>
    <mailto:[email protected]>; [email protected]
    <mailto:[email protected]>
    *抄送:*Zhang, Andy <[email protected]> <mailto:[email protected]>;
    Chen, Horace <[email protected]> <mailto:[email protected]>;
    Zhang, Jack (Jian) <[email protected]> <mailto:[email protected]>
    *主题:*Re: [RFC] a new approach to detect which ring is the real
    black sheep upon TDR reported

    On 2021-02-26 6:54 a.m., Liu, Monk wrote:

        [AMD Official Use Only - Internal Distribution Only]

        See in line

        Thanks

        ------------------------------------------

        Monk Liu | Cloud-GPU Core team

        ------------------------------------------

        *From:*Koenig, Christian <[email protected]>
        <mailto:[email protected]>
        *Sent:* Friday, February 26, 2021 3:58 PM
        *To:* Liu, Monk <[email protected]> <mailto:[email protected]>;
        [email protected]
        <mailto:[email protected]>
        *Cc:* Zhang, Andy <[email protected]>
        <mailto:[email protected]>; Chen, Horace
        <[email protected]> <mailto:[email protected]>; Zhang,
        Jack (Jian) <[email protected]> <mailto:[email protected]>
        *Subject:* Re: [RFC] a new approach to detect which ring is
        the real black sheep upon TDR reported

        Hi Monk,

        in general an interesting idea, but I see two major problems
        with that:

        1. It would make the reset take much longer.

        2. Things get often stuck because of timing issues, so a
        guilty job might pass perfectly when run a second time.

        [ML] but the innocent ring already reported a TDR, and the drm
        sched logic already deleted this “sched_job” in its mirror
        list, thus you don’t have chance to re-submit it again after
        reset, that’s the major problem here.

    Just to confirm I understand correctly, Monk reports a scenario
    where the second TDR that was reported by the innocent job is
    bailing out BEFORE having a chance to run  drm_sched_stop for that
    scheduler which should have reinserted the job back into mirror
    list (because the first TDR run is still in progress and hence
    amdgpu_device_lock_adev fails for the second TDR) and so the
    innocent job which was extracted from mirror list in
    drm_sched_job_timedout is now lost.
    If so and as a possible quick fix until we overhaul the entire
    design as suggested in this thread - maybe we can modify
    drm_sched_backend_ops.timedout_job callback to report back
    premature termination BEFORE drm_sched_stop had a chance to run
    and then reinsert back the job into mirror list from within
    drm_sched_job_timedout? There is no problem of racing against
    concurrent drm_sched_get_cleanup_job once we reinsert there as we
    don't reference the job pointer anymore after this point and so if
    it's already signaled and freed right away - it's ok.

    Andrey


        Apart from that the whole ring mirror list turned out to be a
        really bad idea. E.g. we still struggle with object life time
        because the concept doesn't fit into the object model of the
        GPU scheduler under Linux.

        We should probably work on this separately and straighten up
        the job destruction once more and keep the recovery
        information in the fence instead.

        [ML] we claim to our customer that no innocent process will be
        dropped or cancelled, and our current logic works for the most
        time, but only when there are different process running on
        gfx/computes rings then we would run into the tricky situation
        I stated here, and the proposal is the only way I can figure
        out so far, do you have a better solution or idea we review it
        as another candidate RFC ? Be note that we raised this
        proposal is because we do hit our trouble and we do need to
        resolve it …. So even a not perfect solution is still better
        than just cancel the innocent job (and their context/process)

        Thanks !


        Regards,
        Christian.

        Am 26.02.21 um 06:58 schrieb Liu, Monk:

            [AMD Public Use]

            Hi all

            NAVI2X  project hit a really hard to solve issue now, and
            it is turned out to be a general headache of our TDR
            mechanism , check below scenario:

             1. There is a job1 running on compute1 ring at timestamp
             2. There is a job2 running on gfx ring at timestamp
             3. Job1 is the guilty one, and job1/job2 were scheduled
                to their rings at almost the same timestamp
             4. After 2 seconds we receive two TDR reporting from both
                GFX ring and compute ring
             5. *Current scheme is that in drm scheduler all the head
                jobs of those two rings are considered “bad job” and
                taken away from the mirror list *
             6. The result is both the real guilty job (job1) and the
                innocent job (job2) were all deleted from mirror list,
                and their corresponding contexts were also treated as
                guilty*(so the innocent process remains running is not
                secured)*

            **

            But by our wish the ideal case is TDR mechanism can detect
            which ring is the guilty ring and the innocent ring can
            resubmits all its pending jobs:

             1. Job1 to be deleted from compute1 ring’s mirror list
             2. Job2 is kept and resubmitted later and its belonging
                process/context are even not aware of this TDR at all

            Here I have a proposal tend to achieve above goal and it
            rough procedure is :

             1. Once any ring reports a TDR, the head job is **not**
                treated as “bad job”, and it is **not** deleted from
                the mirror list in drm sched functions
             2. In vendor’s function (our amdgpu driver here):

                  * reset GPU
                  * repeat below actions on each RINGS * one by one *:

            1.take the head job and submit it on this ring

            2.see if it completes, if not then this job is the real
            “bad job”

            3. take it away from mirror list if this head job is “bad job”

                  * After above iteration on all RINGS, we already
                    clears all the bad job(s)

             3. Resubmit all jobs from each mirror list to their
                corresponding rings (this is the existed logic)

            The idea of this is to use “serial” way to re-run and
            re-check each head job of each RING, in order to take out
            the real black sheep and its guilty context.

            P.S.: we can use this approaches only on GFX/KCQ ring
            reports TDR , since those rings are intermutually affected
            to each other. For SDMA ring timeout it definitely proves
            the head job on SDMA ring is really guilty.

            Thanks

            ------------------------------------------

            Monk Liu | Cloud-GPU Core team

            ------------------------------------------




        _______________________________________________

        amd-gfx mailing list

        [email protected]
        <mailto:[email protected]>

        https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: 回复: 回复: [RFC] a new approach to detect which ring is the real black sheep upon TDR reported

Reply via email to