Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Christian König Tue, 02 May 2023 04:14:49 -0700

Hi Timur,

Am 02.05.23 um 11:12 schrieb Timur Kristóf:

Hi Christian,

Christian König <[email protected]> ezt írta (időpont: 2023.máj. 2., Ke 9:59):


    Am 02.05.23 um 03:26 schrieb André Almeida:
    > Em 01/05/2023 16:24, Alex Deucher escreveu:
    >> On Mon, May 1, 2023 at 2:58 PM André Almeida
    <[email protected]>
    >> wrote:
    >>>
    >>> I know that devcoredump is also used for this kind of
    information,
    >>> but I believe
    >>> that using an IOCTL is better for interfacing Mesa + Linux rather
    >>> than parsing
    >>> a file that its contents are subjected to be changed.
    >>
    >> Can you elaborate a bit on that?  Isn't the whole point of
    devcoredump
    >> to store this sort of information?
    >>
    >
    > I think that devcoredump is something that you could use to
    submit to
    > a bug report as it is, and then people can read/parse as they want,
    > not as an interface to be read by Mesa... I'm not sure that it's
    > something that I would call an API. But I might be wrong, if you
    know
    > something that uses that as an API please share.
    >
    > Anyway, relying on that for Mesa would mean that we would need to
    > ensure stability for the file content and format, making it less
    > flexible to modify in the future and probe to bugs, while the
    IOCTL is
    > well defined and extensible. Maybe the dump from Mesa + devcoredump
    > could be complementary information to a bug report.

    Neither using an IOCTL nor devcoredump is a good approach for this
    since
    the values read from the hw register are completely unreliable. They
    could not be available because of GFXOFF or they could be
    overwritten or
    not even updated by the CP in the first place because of a hang
    etc....

    If you want to track progress inside an IB what you do instead is to
    insert intermediate fence write commands into the IB. E.g. something
    like write value X to location Y when this executes.

    This way you can not only track how far the IB processed, but also in
    which stages of processing we where when the hang occurred. E.g.
    End of
    Pipe, End of Shaders, specific shader stages etc...

Currently our biggest challenge in the userspace driver is debugging"random" GPU hangs. We have many dozens of bug reports from userswhich are like: "play the game for X hours and it will eventually hangthe GPU". With the currently available tools, it is impossible for usto tackle these issues. André's proposal would be a step in improvingthis situation.

We already do something like what you suggest, but there are multipleproblems with that approach:

1. we can only submit 1 command buffer at a time because we won't knowwhich IB hanged

2. we can't use chaining because we don't know where in the IB it hanged

3. it needs userspace to insert (a lot of) extra commands such asextra synchronization and memory writes4. It doesn't work when GPU recovery is enabled because theinformation is already gone when we detect the hang

You can still submit multiple IBs and even chain them. All you need todo is to insert into each IB commands which write to an extra memorylocation with the IB executed and the position inside the IB.

The write data command allows to write as many dw as you want (up tomultiple kb). The only potential problem is when you submit the same IBmultiple times.

And yes that is of course quite some extra overhead, but I think thatshould be manageable.

Consequences:

A. It has a huge perf impact, so we can't enable it always
B. Thanks to the extra synchronization, some issues can't bereproduced when this kind of debugging is enabled
C. We have to ask users to disable GPU recovery to collect logs for us
In my opinion, the correct solution to those problems would be if thekernel could give userspace the necessary information about a GPU hangbefore a GPU reset.

The fundamental problem here is that the kernel doesn't have thatinformation either. We know which IB timed out and can potentially do adevcoredump when that happens, but that's it.

The devcoredump can contain register values and memory locations, butsince the ASIC is hung reading back stuff like the CP or powermanagement state is dangerous. Alex already noted as well that wepotentially need to disable GFXOFF to be able to read back register values.

To avoid the massive peformance cost, it would be best if we couldknow which IB hung and what were the commands being executed when ithung (perhaps pointers to the VA of the commands), along with whichshaders were in flight (perhaps pointers to the VA of the shaderbinaries).
If such an interface could be created, that would mean we could easilyquery this information and create useful logs of GPU hangs withoutmuch userspace overhead and without requiring the user to disable GPUresets etc.

Disabling GPU reset is actually not something you can reliably doeither. If the core memory management runs OOM it can chose to wait forthe GPU reset to finish and when that timeout is infinite the systemjust hard hangs.

That's also the reason why you can't wait for userspace to read backstate or something like that.

If it's not possible to do this, we'd appreciate some suggestions onhow to properly solve this without the massive performance cost andwithout requiring the user to disable GPU recovery.

The most doable option I can see is to say instead of resetting the GPUwe tell the OS the GPU was hot unplugged and disable system memoryaccess for the ASIC to prevent random memory corruption.

This way you can investigate the GPU state with tools like umr and onlyafter that is done we do a hot add event and start over from scratch.

The game will obviously crash and needs to be restarted, but that isstill better than a full system crash.

Side note, it is also extremely difficult to even determine whetherthe problem is in userspace or the kernel. While kernel developersusually dismiss all GPU hangs as userspace problems, we've seen manyissues where the problem was in the kernel (eg. bugs where wrongvoltages were set, etc.) - any idea for tackling those kind of issuesis also welcome.

No idea either. If I had a better idea how to triage things my livewould be much simpler as well :)


Regards,
Christian.


Thanks & best regards,
Timur

Re: [RFC PATCH 0/1] Add AMDGPU_INFO_GUILTY_APP ioctl

Reply via email to