Hi Kevin,
OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.
Ugh, that would be an absolute no-go for upstreaming.
Is it too much to ask for using more BUG_ON null pointer assertions for TTM
callbacks?
I don't think that this is useful at all. See a BUG_ON() has the same
result as a NULL pointer dereference.
While that may be true, I do plan to implement acceleration later, and this is
why I do not want to settle with the GEM VRAM helpers.
TTM is quite a mess and the effort here is essential to clean it up and
kill most of the driver specific workarounds we have in there.
As the maintainer of all of this I would probably reject any newly added
driver which is using the layer directly instead of the VRAM GEM wrapper.
Regards,
Christian.
Am 19.10.20 um 18:20 schrieb Kevin Brace:
Hi Christian,
I looked into a few more things, and figured out why OpenChrome DRM was not
booting X Server.
Now the situation is under control in my side of the world (OpenChrome
development world), and I got X Server working again with drm-next 5.10 code
and OpenChrome DRM.
Code will be committed into OpenChrome DRM upstream repository's drm-next-5.10
branch in the next few days.
I can see that OpenChrome DRM "just happened to" work without
ttm_tt_create/destroy callbacks being specified.
I can fix this issue easily.
The other issue I was facing was that commit
48e07c23cbeba2a2cda7ca73be0015e727818536 (drm/ttm: nuke memory type flags)
discontinued TTM_PL_FLAG_* flags.
This caused incompatibility between OpenChrome DDX and the updated OpenChrome
DRM that incorporated the changes made with commit 48e07c2.
OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.
I changed OpenChrome DDX code to use TTM_PL_* instead for allocating memory via
TTM, and OpenChrome DRM based Linux kernel was able to boot X Server properly.
Again, the code change to the DDX will happen in a few days.
Before I moved on from this issue, I will like to ask for one request
regarding TTM and its callbacks.
Is it too much to ask for using more BUG_ON null pointer assertions for TTM
callbacks?
Although I did not bring it up with you back then, I did get hit with an issue
similar to this 3 years ago.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-4.13%26id%3D3c08ec601bb1ccd5ff58a9101317b728aa7204bd&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=wmGsqFinBsTLJSgp6G5ygoW7J%2Fppph4CzgO9q7q4h34%3D&reserved=0
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-4.13%26id%3D2064175f977d9859831be653df16e3ea10415a8a&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=1Pl3PZ%2FY8P2aFE9in3oYCgl1CW6Nq%2FIQiHe75U8Dp0I%3D&reserved=0
I did not send you an e-mail about it, I do recall complaining to Alex Deucher
about the above io_mem_pfn null pointer bug at XDC2017 (I did attend it, and I
presented about the state of OpenChrome Project.).
I brought this up with Alex because I stuck for more than 2 weeks to figure out
the issue (the commit for overcoming io_mem_pfn callback issue was made on
2017-09-13, but the commit prior to that was on 2017-08-28).
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Flog%2F%3Fh%3Ddrm-next-4.13%26ofs%3D50&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=zsLWqjAPDFhokHptCAHl4XRCZJSjcWsnvAimoUyb7Mk%3D&reserved=0
Anyway, someone else got hit by the same issue a month or two later
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2017-November%2F159002.html&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=wbkUKUL0XVOviw%2Bvt0WFGvSyJiOF%2Bdz%2FZVSMSPTwuwI%3D&reserved=0
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2017-December%2F161168.html&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=nJ8n0lFRTX%2FoxPsKCvV%2FYvtUvmvzxLGVikUWtBsZy5U%3D&reserved=0
These are the commits that fixed the issue.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Fcommit%2F%3Fid%3Dea642c3216cb2a60d1c0e760ae47ee85c9c16447&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=aL1C28MKl5SPRsP%2F9A6XWFMmtXmxymmGic3vLZe5%2FNw%3D&reserved=0
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Fcommit%2F%3Fid%3Dc67fa6edc8b11afe22c88a23963170bf5f151acf&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=%2BY1z8sgjIf%2FPiJG9PBIOnsHuRCFq1uHNSVJodWQz%2B5k%3D&reserved=0
If you compare the commit date, I figured it out io_mem_pfn null pointer bug 2
to 3 months earlier.
I think the problem we got with the latest code change to the TTM is, many TTM
callbacks do not check for a null pointer for mandatory callbacks.
Also, the ttm_bo_driver struct comment section inside ttm_bo_driver.h does not
make it 100% clear that what is mandatory and what is not.
If these were done, I am sure I would have left ttm_tt_create/destroy callbacks
intact.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-5.10%26id%3D64947142a1cf8b83faa73da7aa35a17f6a24568a&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=egysIQ%2BgxKi3ujKmzyRNRvI93SY%2BFI5dlW%2FvyGglAn8%3D&reserved=0
(scroll down to openchrome/openchrome_ttm.c)
Regarding your suggestion that I should abandon OpenChrome DRM's current
GEM / TTM implementation and instead I should use the GEM VRAM helpers, since I
was able to figure out the issues with some suggestions from you and Dave, I do
not think it is worth switching to GEM VRAM helpers at this point.
Obviously, the current implementation does not support acceleration, hence, it
appears to be a good candidate for switching to GEM VRAM helpers.
While that may be true, I do plan to implement acceleration later, and this is
why I do not want to settle with the GEM VRAM helpers.
Regards,
Kevin Brace
Brace Computer Laboratory blog
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=a11wNdmtVlgNBWHuBnxT%2BhCZRC%2BCtZw2xrVebNylSLU%3D&reserved=0
Sent: Monday, October 19, 2020 at 3:13 AM
From: "Christian König" <[email protected]>
To: "Kevin Brace" <[email protected]>, "Dave Airlie" <[email protected]>,
"dri-devel" <[email protected]>
Subject: Re: It appears drm-next TTM cleanup broke something . . .
Hi Kevin,
the basic problem you are facing is that ttm_tt_create/destroy is
mandatory (It always was). You need an implementation or otherwise you
won't be able to use the system domain (additional to the optional GTT
domain).
My best guess is that the difference is that we now force to initiate
the system domain for all drivers.
If that is correct you just that you never ran into because you never
correctly initialized TTM to support buffer moves.
I'm not sure what exactly the OpenChrome DRM driver is doing, but I
strongly suggest to just drop TTM support completely and use the GEM
VRAM helper layer instead.
Regards,
Christian.
Am 19.10.20 um 09:23 schrieb Kevin Brace:
Hi Dave,
Yeah, with the workaround I mentioned in my previous e-mail, OpenChrome DRM does not
crash for "ttm_tt_create" member being null.
It is still not able to boot X Server due to some other TTM related memory
allocation issue it is suffering from.
I think making huge changes to TTM during this development cycle broke
OpenChrome DRM.
Following up on the question I raised during the previous e-mail.
Shouldn't "use_tt" parameter being "false" for ttm_range_man_init() disable TTM
TT functionality?
I feel like that should be the expected behavior.
Again, there is only 5 to 6 more days left until Linux 5.10-rc2, so I decided
to contact you on Sunday (I consider this bug to be urgent.).
Assuming what I am asserting is correct, I think the reason why this was not
discovered earlier was due to the following reasons.
1) nouveau, radeon, and amdgpu already use TTM TT functionality.
2) ast uses GEM VRAM helper that internally uses TTM. It populates "ttm_tt_create" and
"ttm_tt_destroy" members, hence, the developers did not notice the breakage.
3) OpenChrome DRM is still not in the mainline tree, so no one other than
myself noticed the problem until now.
Regarding the TTM TT functionality, OpenChrome DRM currently does not support acceleration, hence,
I did not believe it was necessary to populate "ttm_tt_create" and
"ttm_tt_destroy" members.
That implementation worked fine until the previous development cycle code.
Of course, I will eventually add support for acceleration, hence, TTM TT
functionality will be utilized at some point.
Regards,
Kevin Brace
Brace Computer Laboratory blog
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=a11wNdmtVlgNBWHuBnxT%2BhCZRC%2BCtZw2xrVebNylSLU%3D&reserved=0
Sent: Sunday, October 18, 2020 at 12:50 PM
From: "Dave Airlie" <[email protected]>
To: "Kevin Brace" <[email protected]>, "Christian König"
<[email protected]>
Cc: "dri-devel" <[email protected]>, "Dave Airlie"
<[email protected]>
Subject: Re: It appears drm-next TTM cleanup broke something . . .
On Mon, 19 Oct 2020 at 05:15, Kevin Brace <[email protected]> wrote:
Hi Dave,
It is a little urgent, so I am writing this right now.
As usual, I pulled in DRM repository code for an out of tree OpenChrome DRM
repository a few days ago.
While going through the changes I need to make to OpenChrome DRM to compile
with the latest Linux kernel, I noticed that ttm_bo_init_mm() was discontinued,
and it was replaced with ttm_range_man_init().
ttm_range_man_init() has a parameter called "bool use_tt", but honestly, I do
not think it is functioning correctly.
If I keep "ttm_tt_create" member of ttm_bo_driver struct null by not specifying
it, TTM still tries to call it, and crashes due to a null pointer access.
The workaround I found so far is to specify the "ttm_tt_create" member by
copying bo_driver_ttm_tt_create() from drm/drm_gem_vram_helper.c.
This is what the call trace looks like without specifying the "ttm_tt_create"
member (i.e., this member is null).
cc'ing Christian,
I can't remember if we did this deliberately or if just worked by
accident previously.
Either way, you should probably need a ttm_tt_create going forward.
Dave.
_______________________________________________
. . .
kernel: [ 34.310674] [drm:openchrome_bo_create [openchrome]] Entered
openchrome_bo_create.
kernel: [ 34.310697] [drm:openchrome_ttm_domain_to_placement [openchrome]]
Entered openchrome_ttm_domain_to_placement.
kernel: [ 34.310706] [drm:openchrome_ttm_domain_to_placement [openchrome]]
Exiting openchrome_ttm_domain_to_placement.
kernel: [ 34.310737] BUG: kernel NULL pointer dereference, address:
0000000000000000
kernel: [ 34.310742] #PF: supervisor instruction fetch in kernel mode
kernel: [ 34.310745] #PF: error_code(0x0010) - not-present page
. . .
kernel: [ 34.310807] Call Trace:
kernel: [ 34.310827] ttm_tt_create+0x5f/0xa0 [ttm]
kernel: [ 34.310839] ttm_bo_validate+0xb8/0x140 [ttm]
kernel: [ 34.310886] ? drm_vma_offset_add+0x56/0x70 [drm]
kernel: [ 34.310897] ? openchrome_gem_create_ioctl+0x150/0x150 [openchrome]
. . .
_______________________________________________
The erroneous call to "ttm_tt_create" member happens right after TTM placement
is performed (openchrome_ttm_domain_to_placement()).
Currently, OpenChrome DRM's TTM implementation does not use "ttm_tt_create"
member, and this arrangement worked fine until Linux 5.9's drm-next code.
It appears that Linux 5.10's drm-next code broke the code.
Regards,
Kevin Brace
Brace Computer Laboratory blog
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462975986&sdata=oQr07H953zWryuBQJyPgFsqtOkXAtfprsxiA4ny%2F4LE%3D&reserved=0
_______________________________________________
dri-devel mailing list
[email protected]
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462975986&sdata=gq2l57u0hZD2arzcWU2ppTuA0mgcshZHI%2BVvHYJ8hcs%3D&reserved=0
_______________________________________________
dri-devel mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/dri-devel