amdgpu: Take a reference to an exported BO

Felix Kuehling Tue, 05 May 2020 13:10:54 -0700

Am 2020-05-05 um 1:29 p.m. schrieb Felix Kuehling:
> Am 2020-05-05 um 11:19 a.m. schrieb Christian König:
>> Am 05.05.20 um 16:58 schrieb Felix Kuehling:
>>> Am 2020-05-05 um 3:47 a.m. schrieb Christian König:
>>>> Just to reply here once more, this patch is a clear NAK.
>>>
>>> Agreed. But see below. I don't think all is well.
>>>
>>>>
>>>> The references are grabbed in the call path of
>>>> drm_gem_prime_export() and dropped again in drm_gem_dmabuf_release().
>>>>
>>>> So they are perfectly balanced as far as I can see.
>>>
>>> That is true for the GEM object references. But I believe there is
>>> still a problem with the TTM BO references.
>>>
>>> As far as I can tell amdgpu_bo_unref can free the TTM BO while there
>>> are still references to the GEM object from DMA buf exports. I think
>>> that's a fundamental problem with how we have two reference counts
>>> for the same physical object (TTM BO and the embedded GEM BO).
>>>
>>
>> Completely agree, I also mentioned that problem during my talk on
>> FOSDEM. But calling amdgpu_bo_unref() to often is a bug in itself.
>
> That's not the problem here.
>
I think we may have a fundamental issue with how we release memory in
KFD using amdgpu_bo_unref. That worked OK before the GEM object was
embedded in the TTM BO, and it continued working for memory that's not
shared via DMABufs.


Alejandro is testing a patch that changes KFD to release its BOs with
drm_gem_object_put_unlocked instead of amdgpu_bo_unref.

Ignore my musings below. I think they were based on incorrect
assumptions that it's OK to release GEM BOs with amdgpu_bo_unref.

Regards,
  Felix


>
>>
>> What we could probably do to detect this is adding a BUG_ON() in TTMs
>> release function and checking if the GEM reference count is really dead.
>
> The problem is, that we have to guess whether there are still any
> dmabuf references to the GEM BO. There is no way amdgpu can know that.
> You can't make amdgpu responsible for keeping a reference to the TTM
> BO while the GEM BO is still referenced by entities completely out of
> the control of amdgpu.
>
> Another weird thing I see is that amdgpu_gem_free_object calls
> amdgpu_bo_unref. That implies that the GEM object conceptually holds a
> reference to the amdpu/TTM BO. But that is not really the case. Amdgpu
> never takes that reference that GEM is supposed to own. If it did, we
> would leak all our memory because nobody would ever drop that reference.
>
>>
>>> I think the correct solution is for amdgpu_bo_ref/unref to delegate
>>> its reference counting to drm_gem_object_get/put instead of
>>> ttm_bo_get/put. The amdgpu BO would hold one token reference to the
>>> TTM BO, which it can drop when the GEM BO refcount drops to 0.
>>> Finally, the amdgpu BO should only be freed once the TTM BO refcount
>>> also becomes 0.
>>
>> Just the other way around, but yes the long term plan should probably
>> be to merge the two.
>
> I need a short term solution. Because I have a bug that causes a
> kernel oops with applications that are valid and correct, as far as I
> can tell.
>
> I'm thinking a solution that doesn't require major changes to the way
> TTM and GEM interact would put amdgpu in charge of coordinating the
> two. Unfortunately that would mean adding a third reference count in
> amdgpu_bo, in addition to the ones in TTM and GEM. The amdgpu BO would
> hold one token reference to each of the GEM and TTM BO. When amdgpu
> refcount goes to 0 it releases that GEM BO token reference. When the
> GEM BO refcount goes  to 0, we get a callback amdgpu_gem_object_free.
> There we can drop the token reference to the TTM BO. Once the TTM BO
> reference goes to 0 we free the memory.
>
> Does this sound feasible?
>
> Regards,
>   Felix
>
>
>>
>> The difficult is currently we have a mismatch what locks could be
>> taken when we drop the references.
>>
>> Regards,
>> Christian.
>>
>>> Regards,
>>>   Felix
>>>
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 01.05.20 um 16:44 schrieb Felix Kuehling:
>>>>>
>>>>> [dropping my gmail address]
>>>>>
>>>>> We saw this backtrace showing the call chain while investigating a
>>>>> kernel oops caused by this issue on the DKMS branch with the KFD
>>>>> IPC API. It happens after a dma-buf file is released with fput:
>>>>>
>>>>> [ 1255.049330] BUG: kernel NULL pointer dereference, address: 
>>>>> 000000000000051e
>>>>> [ 1255.049727] #PF: supervisor read access in kernel mode
>>>>> [ 1255.050092] #PF: error_code(0x0000) - not-present page
>>>>> [ 1255.050416] PGD 0 P4D 0
>>>>> [ 1255.050736] Oops: 0000 [#1] SMP PTI
>>>>> [ 1255.051060] CPU: 27 PID: 2292 Comm: kworker/27:2 Tainted: G           
>>>>> OE     5.3.0-46-generic #38~18.04.1-Ubuntu
>>>>> [ 1255.051400] Hardware name: Supermicro SYS-4029GP-TRT2/X11DPG-OT-CPU, 
>>>>> BIOS 3.0a 02/26/2019
>>>>> [ 1255.051752] Workqueue: events delayed_fput
>>>>> [ 1255.052111] RIP: 0010:drm_gem_object_put_unlocked+0x1c/0x70 [drm]
>>>>> [ 1255.052465] Code: 4d 80 c8 ee 0f 0b eb d8 66 0f 1f 44 00 00 0f 1f 44 
>>>>> 00 00 48 85 ff 74 34 55 48 89 e5 41 54 53 48 89 fb 48 8b 7f 08 48 8b 47 
>>>>> 20 <48> 83 b8 a0 00 00 00 00 74 1a 4c 8d 67 68 48 89 df 4c 89 e6 e8 9b
>>>>> [ 1255.053224] RSP: 0018:ffffb4b62035fdc8 EFLAGS: 00010286
>>>>> [ 1255.053613] RAX: 000000000000047e RBX: ffff9f2add197850 RCX: 
>>>>> 0000000000000000
>>>>> [ 1255.054032] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
>>>>> ffff9f2aa2548aa0
>>>>> [ 1255.054440] RBP: ffffb4b62035fdd8 R08: 0000000000000000 R09: 
>>>>> 0000000000000000
>>>>> [ 1255.054860] R10: 0000000000000010 R11: ffff9f2a4b1cc310 R12: 
>>>>> 0000000000080005
>>>>> [ 1255.055268] R13: ffff9f2a4b1cc310 R14: ffff9f4e369161e0 R15: 
>>>>> ffff9f2a1b2f9080
>>>>> [ 1255.055674] FS:  0000000000000000(0000) GS:ffff9f4e3f740000(0000) 
>>>>> knlGS:0000000000000000
>>>>> [ 1255.056087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> [ 1255.056501] CR2: 000000000000051e CR3: 00000002df00a004 CR4: 
>>>>> 00000000007606e0
>>>>> [ 1255.056923] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
>>>>> 0000000000000000
>>>>> [ 1255.057345] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
>>>>> 0000000000000400
>>>>> [ 1255.057763] PKRU: 55555554
>>>>> [ 1255.058179] Call Trace:
>>>>> [ 1255.058603]  drm_gem_dmabuf_release+0x1a/0x30 [drm]
>>>>> [ 1255.059025]  dma_buf_release+0x56/0x130
>>>>> [ 1255.059443]  __fput+0xc6/0x260
>>>>> [ 1255.059856]  delayed_fput+0x20/0x30
>>>>> [ 1255.060272]  process_one_work+0x1fd/0x3f0
>>>>> [ 1255.060686]  worker_thread+0x34/0x410
>>>>> [ 1255.061099]  kthread+0x121/0x140
>>>>> [ 1255.061510]  ? process_one_work+0x3f0/0x3f0
>>>>> [ 1255.061923]  ? kthread_park+0xb0/0xb0
>>>>> [ 1255.062336]  ret_from_fork+0x35/0x40
>>>>>
>>>>> drm_gem_object_put_unlocked calls drm_gem_object_free when the
>>>>> obj->refcount reaches 0. From there it calls
>>>>> dev->driver->gem_free_object_unlocked, which is
>>>>> amdgpu_gem_object_free in amdgpu.
>>>>>
>>>>> Regards,
>>>>>   Felix
>>>>>
>>>>> Am 2020-05-01 um 10:29 a.m. schrieb Christian König:
>>>>>> Am 01.05.20 um 16:21 schrieb Felix Kuehling:
>>>>>>> From: Felix Kuehling <[email protected]>
>>>>>>>
>>>>>>> That reference gets dropped when the the dma-buf is freed. Not
>>>>>>> incrementing
>>>>>>> the refcount can lead to use-after-free errors.
>>>>>>>
>>>>>>> Signed-off-by: Felix Kuehling <[email protected]>
>>>>>>> ---
>>>>>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c | 9 ++++++++-
>>>>>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> index ffeb20f11c07..a0f9b3ef4aad 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c
>>>>>>> @@ -398,8 +398,15 @@ struct dma_buf
>>>>>>> *amdgpu_gem_prime_export(struct drm_gem_object *gobj,
>>>>>>>           return ERR_PTR(-EPERM);
>>>>>>>         buf = drm_gem_prime_export(gobj, flags);
>>>>>>> -    if (!IS_ERR(buf))
>>>>>>> +    if (!IS_ERR(buf)) {
>>>>>>>           buf->ops = &amdgpu_dmabuf_ops;
>>>>>>> +        /* GEM needs a reference to the underlying object
>>>>>>> +         * that gets dropped when the dma-buf is released,
>>>>>>> +         * through the amdgpu_gem_object_free callback
>>>>>>> +         * from drm_gem_object_put_unlocked.
>>>>>>> +         */
>>>>>>> +        amdgpu_bo_ref(bo);
>>>>>>> +    }
>>>>>>
>>>>>> Of hand that doesn't sounds correct to me. Why should the
>>>>>> exported bo be closed through amdgpu_gem_object_free()?
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>         return buf;
>>>>>>>   }
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> amd-gfx mailing list
>>>>> [email protected]
>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>>>>
>>

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: [PATCH 1/1] drm/amdgpu: Take a reference to an exported BO

Reply via email to