Am 2020-05-05 um 1:29 p.m. schrieb Felix Kuehling: > Am 2020-05-05 um 11:19 a.m. schrieb Christian König: >> Am 05.05.20 um 16:58 schrieb Felix Kuehling: >>> Am 2020-05-05 um 3:47 a.m. schrieb Christian König: >>>> Just to reply here once more, this patch is a clear NAK. >>> >>> Agreed. But see below. I don't think all is well. >>> >>>> >>>> The references are grabbed in the call path of >>>> drm_gem_prime_export() and dropped again in drm_gem_dmabuf_release(). >>>> >>>> So they are perfectly balanced as far as I can see. >>> >>> That is true for the GEM object references. But I believe there is >>> still a problem with the TTM BO references. >>> >>> As far as I can tell amdgpu_bo_unref can free the TTM BO while there >>> are still references to the GEM object from DMA buf exports. I think >>> that's a fundamental problem with how we have two reference counts >>> for the same physical object (TTM BO and the embedded GEM BO). >>> >> >> Completely agree, I also mentioned that problem during my talk on >> FOSDEM. But calling amdgpu_bo_unref() to often is a bug in itself. > > That's not the problem here. > I think we may have a fundamental issue with how we release memory in KFD using amdgpu_bo_unref. That worked OK before the GEM object was embedded in the TTM BO, and it continued working for memory that's not shared via DMABufs.
Alejandro is testing a patch that changes KFD to release its BOs with drm_gem_object_put_unlocked instead of amdgpu_bo_unref. Ignore my musings below. I think they were based on incorrect assumptions that it's OK to release GEM BOs with amdgpu_bo_unref. Regards, Felix > >> >> What we could probably do to detect this is adding a BUG_ON() in TTMs >> release function and checking if the GEM reference count is really dead. > > The problem is, that we have to guess whether there are still any > dmabuf references to the GEM BO. There is no way amdgpu can know that. > You can't make amdgpu responsible for keeping a reference to the TTM > BO while the GEM BO is still referenced by entities completely out of > the control of amdgpu. > > Another weird thing I see is that amdgpu_gem_free_object calls > amdgpu_bo_unref. That implies that the GEM object conceptually holds a > reference to the amdpu/TTM BO. But that is not really the case. Amdgpu > never takes that reference that GEM is supposed to own. If it did, we > would leak all our memory because nobody would ever drop that reference. > >> >>> I think the correct solution is for amdgpu_bo_ref/unref to delegate >>> its reference counting to drm_gem_object_get/put instead of >>> ttm_bo_get/put. The amdgpu BO would hold one token reference to the >>> TTM BO, which it can drop when the GEM BO refcount drops to 0. >>> Finally, the amdgpu BO should only be freed once the TTM BO refcount >>> also becomes 0. >> >> Just the other way around, but yes the long term plan should probably >> be to merge the two. > > I need a short term solution. Because I have a bug that causes a > kernel oops with applications that are valid and correct, as far as I > can tell. > > I'm thinking a solution that doesn't require major changes to the way > TTM and GEM interact would put amdgpu in charge of coordinating the > two. Unfortunately that would mean adding a third reference count in > amdgpu_bo, in addition to the ones in TTM and GEM. The amdgpu BO would > hold one token reference to each of the GEM and TTM BO. When amdgpu > refcount goes to 0 it releases that GEM BO token reference. When the > GEM BO refcount goes to 0, we get a callback amdgpu_gem_object_free. > There we can drop the token reference to the TTM BO. Once the TTM BO > reference goes to 0 we free the memory. > > Does this sound feasible? > > Regards, > Felix > > >> >> The difficult is currently we have a mismatch what locks could be >> taken when we drop the references. >> >> Regards, >> Christian. >> >>> Regards, >>> Felix >>> >>> >>>> >>>> Regards, >>>> Christian. >>>> >>>> Am 01.05.20 um 16:44 schrieb Felix Kuehling: >>>>> >>>>> [dropping my gmail address] >>>>> >>>>> We saw this backtrace showing the call chain while investigating a >>>>> kernel oops caused by this issue on the DKMS branch with the KFD >>>>> IPC API. It happens after a dma-buf file is released with fput: >>>>> >>>>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: >>>>> 000000000000051e >>>>> [ 1255.049727] #PF: supervisor read access in kernel mode >>>>> [ 1255.050092] #PF: error_code(0x0000) - not-present page >>>>> [ 1255.050416] PGD 0 P4D 0 >>>>> [ 1255.050736] Oops: 0000 [#1] SMP PTI >>>>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G >>>>> OE 5.3.0-46-generic #38~18.04.1-Ubuntu >>>>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, >>>>> BIOS 3.0a 02/26/2019 >>>>> [ 1255.051752] Workqueue: events delayed_fput >>>>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm] >>>>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 >>>>> 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 >>>>> 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b >>>>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286 >>>>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: >>>>> 0000000000000000 >>>>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>>>> ffff9f2aa2548aa0 >>>>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: >>>>> 0000000000000000 >>>>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: >>>>> 0000000000080005 >>>>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: >>>>> ffff9f2a1b2f9080 >>>>> [ 1255.055674] FS: 0000000000000000(0000) GS:ffff9f4e3f740000(0000) >>>>> knlGS:0000000000000000 >>>>> [ 1255.056087] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: >>>>> 00000000007606e0 >>>>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>>> 0000000000000000 >>>>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: >>>>> 0000000000000400 >>>>> [ 1255.057763] PKRU: 55555554 >>>>> [ 1255.058179] Call Trace: >>>>> [ 1255.058603] drm_gem_dmabuf_release+0x1a/0x30 [drm] >>>>> [ 1255.059025] dma_buf_release+0x56/0x130 >>>>> [ 1255.059443] __fput+0xc6/0x260 >>>>> [ 1255.059856] delayed_fput+0x20/0x30 >>>>> [ 1255.060272] process_one_work+0x1fd/0x3f0 >>>>> [ 1255.060686] worker_thread+0x34/0x410 >>>>> [ 1255.061099] kthread+0x121/0x140 >>>>> [ 1255.061510] ? process_one_work+0x3f0/0x3f0 >>>>> [ 1255.061923] ? kthread_park+0xb0/0xb0 >>>>> [ 1255.062336] ret_from_fork+0x35/0x40 >>>>> >>>>> drm_gem_object_put_unlocked calls drm_gem_object_free when the >>>>> obj->refcount reaches 0. From there it calls >>>>> dev->driver->gem_free_object_unlocked, which is >>>>> amdgpu_gem_object_free in amdgpu. >>>>> >>>>> Regards, >>>>> Felix >>>>> >>>>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König: >>>>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling: >>>>>>> From: Felix Kuehling <[email protected]> >>>>>>> >>>>>>> That reference gets dropped when the the dma-buf is freed. Not >>>>>>> incrementing >>>>>>> the refcount can lead to use-after-free errors. >>>>>>> >>>>>>> Signed-off-by: Felix Kuehling <[email protected]> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++- >>>>>>> 1 file changed, 8 insertions(+), 1 deletion(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>>>> index ffeb20f11c07..a0f9b3ef4aad 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c >>>>>>> @@ -398,8 +398,15 @@ struct dma_buf >>>>>>> *amdgpu_gem_prime_export(struct drm_gem_object *gobj, >>>>>>> return ERR_PTR(-EPERM); >>>>>>> buf = drm_gem_prime_export(gobj, flags); >>>>>>> - if (!IS_ERR(buf)) >>>>>>> + if (!IS_ERR(buf)) { >>>>>>> buf->ops = &amdgpu_dmabuf_ops; >>>>>>> + /* GEM needs a reference to the underlying object >>>>>>> + * that gets dropped when the dma-buf is released, >>>>>>> + * through the amdgpu_gem_object_free callback >>>>>>> + * from drm_gem_object_put_unlocked. >>>>>>> + */ >>>>>>> + amdgpu_bo_ref(bo); >>>>>>> + } >>>>>> >>>>>> Of hand that doesn't sounds correct to me. Why should the >>>>>> exported bo be closed through amdgpu_gem_object_free()? >>>>>> >>>>>> Regards, >>>>>> Christian. >>>>>> >>>>>>> return buf; >>>>>>> } >>>>>> >>>>> >>>>> _______________________________________________ >>>>> amd-gfx mailing list >>>>> [email protected] >>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>> >>
_______________________________________________ amd-gfx mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/amd-gfx
