> 2025年1月4日 01:33,Chen, Xiaogang <[email protected]> 写道:
>
>
>
> On 1/3/2025 1:05 AM, Gerry Liu wrote:
>>
>>
>>> 2025年1月3日 14:19,Chen, Xiaogang <[email protected]
>>> <mailto:[email protected]>> 写道:
>>>
>>>
>>>
>>> On 1/2/2025 11:55 PM, Gerry Liu wrote:
>>>>
>>>>
>>>>> 2025年1月3日 13:44,Chen, Xiaogang <[email protected]
>>>>> <mailto:[email protected]>> 写道:
>>>>>
>>>>>
>>>>>
>>>>> On 1/2/2025 8:22 PM, Gerry Liu wrote:
>>>>>>
>>>>>>
>>>>>>> 2025年1月3日 07:08,Chen, Xiaogang <[email protected]
>>>>>>> <mailto:[email protected]>> 写道:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 1/1/2025 11:36 PM, Jiang Liu wrote:
>>>>>>>> On error recover path during device probe, it may trigger invalid
>>>>>>>> memory access as below:
>>>>>>>> 024-12-25 12:00:53 [ 2703.773040] general protection fault, probably
>>>>>>>> for non-canonical address 0x52445f4749464e4f: 0000 [#1] SMP NOPTI
>>>>>>>> 2024-12-25 12:00:53 [ 2703.785199] CPU: 157 PID: 151951 Comm: rmmod
>>>>>>>> Kdump: loaded Tainted: G W OE 5.10.134-17.2.al8.x86_64 #1
>>>>>>>> 2024-12-25 12:00:53 [ 2703.797515] Hardware name: Alibaba Alibaba
>>>>>>>> Cloud ECS/Alibaba Cloud ECS, BIOS 3.0.ES.AL.P.087.05 04/07/2024
>>>>>>>> 2024-12-25 12:00:53 [ 2703.809188] RIP:
>>>>>>>> 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]
>>>>>>>> 2024-12-25 12:00:53 [ 2703.816136] Code: ff 48 c7 83 38 01 00 00 80 45
>>>>>>>> e4 c0 c7 83 40 01 00 00 08 0f 00 00 e9 cd fa ff ff 66 0f 1f 84 00 00
>>>>>>>> 00 00 00 0f
>>>>>>>> 1f 44 00 00 55 <80> bf 28 01 00 00 00 48 89 fd 75 09 48 89 ef 5d e9 65
>>>>>>>> df 9d f4 8b
>>>>>>>> 2024-12-25 12:00:54 [ 2703.838622] RSP: 0018:ffffb5313df07e10 EFLAGS:
>>>>>>>> 00010202
>>>>>>>> 2024-12-25 12:00:54 [ 2703.845216] RAX: 0000000000000000 RBX:
>>>>>>>> ffff97ad689a3ff0 RCX: 0000000080400014
>>>>>>>> 2024-12-25 12:00:54 [ 2703.853935] RDX: 0000000080400015 RSI:
>>>>>>>> ffff97ad627e93d8 RDI: 52445f4749464e4f
>>>>>>>> 2024-12-25 12:00:54 [ 2703.862652] RBP: ffff97ad689a3ff0 R08:
>>>>>>>> 0000000000000000 R09: ffffffffb5814c00
>>>>>>>> 2024-12-25 12:00:54 [ 2703.871368] R10: ffff97ad627e9280 R11:
>>>>>>>> 0000000000000001 R12: ffffb5313df07e98
>>>>>>>> 2024-12-25 12:00:54 [ 2703.880068] R13: ffff97ad689a1810 R14:
>>>>>>>> 0000000000000001 R15: 0000000000000000
>>>>>>>> 2024-12-25 12:00:54 [ 2703.888768] FS: 00007fa4db81e740(0000)
>>>>>>>> GS:ffff98a93ec80000(0000) knlGS:0000000000000000
>>>>>>>> 2024-12-25 12:00:54 [ 2703.898547] CS: 0010 DS: 0000 ES: 0000 CR0:
>>>>>>>> 0000000080050033
>>>>>>>> 2024-12-25 12:00:54 [ 2703.905684] CR2: 00007f4502deca00 CR3:
>>>>>>>> 000001008fc50001 CR4: 0000000002770ee0
>>>>>>>> 2024-12-25 12:00:54 [ 2703.914382] DR0: 0000000000000000 DR1:
>>>>>>>> 0000000000000000 DR2: 0000000000000000
>>>>>>>> 2024-12-25 12:00:54 [ 2703.923066] DR3: 0000000000000000 DR6:
>>>>>>>> 00000000fffe07f0 DR7: 0000000000000400
>>>>>>>> 2024-12-25 12:00:54 [ 2703.931746] PKRU: 55555554
>>>>>>>> 2024-12-25 12:00:54 [ 2703.935444] Call Trace:
>>>>>>>> 2024-12-25 12:00:54 [ 2703.938962]
>>>>>>>> amdgpu_amdkfd_device_fini_sw+0x1a/0x40 [amdgpu]
>>>>>>>> 2024-12-25 12:00:54 [ 2703.946080]
>>>>>>>> amdgpu_device_ip_fini.isra.0+0x3d/0x1b0 [amdgpu]
>>>>>>>> 2024-12-25 12:00:54 [ 2703.953278] amdgpu_device_fini_sw+0x2a/0x240
>>>>>>>> [amdgpu]
>>>>>>>> 2024-12-25 12:00:54 [ 2703.959789]
>>>>>>>> amdgpu_driver_release_kms+0x12/0x30 [amdgpu]
>>>>>>>> 2024-12-25 12:00:54 [ 2703.966501]
>>>>>>>> devm_drm_dev_init_release+0x42/0x70 [drm]
>>>>>>>> 2024-12-25 12:00:54 [ 2703.972891] release_nodes+0x6e/0xb0
>>>>>>>> 2024-12-25 12:00:54 [ 2703.977522] amdgpu_xcp_drv_release+0x38/0x80
>>>>>>>> [amdxcp]
>>>>>>>> 2024-12-25 12:00:54 [ 2703.983906]
>>>>>>>> __do_sys_delete_module.constprop.0+0x138/0x2a0
>>>>>>>> 2024-12-25 12:00:54 [ 2703.990775] ? exit_to_user_mode_loop+0xd6/0x120
>>>>>>>> 2024-12-25 12:00:54 [ 2703.996563] do_syscall_64+0x2e/0x50
>>>>>>>> 2024-12-25 12:00:54 [ 2704.001166]
>>>>>>>> entry_SYSCALL_64_after_hwframe+0x62/0xc7
>>>>>>>> 2024-12-25 12:00:54 [ 2704.007432] RIP: 0033:0x7fa4db2620cb
>>>>>>>> 2024-12-25 12:00:54 [ 2704.012025] Code: 73 01 c3 48 8b 0d a5 6d 19 00
>>>>>>>> f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f
>>>>>>>> 1e fa b8 b0
>>>>>>>> 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 6d 19 00 f7 d8
>>>>>>>> 64 89 01 48
>>>>>>>>
>>>>>>>> Signed-off-by: Jiang Liu <[email protected]>
>>>>>>>> <mailto:[email protected]>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/amd/amdkfd/kfd_device.c | 7 ++++++-
>>>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> index b6c5ffd4630b..58c1b5fcf785 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_device.c
>>>>>>>> @@ -663,6 +663,8 @@ static void kfd_cleanup_nodes(struct kfd_dev *kfd,
>>>>>>>> unsigned int num_nodes)
>>>>>>>>
>>>>>>>> for (i = 0; i < num_nodes; i++) {
>>>>>>>> knode = kfd->nodes[i];
>>>>>>>> + if (!knode)
>>>>>>>> + continue;
>>>>>>> I wonder how knode can be null here? kfd_cleanup_nodes is called by
>>>>>>> kgd2kfd_device_exit under condition if (kfd->init_complete).
>>>>>>> kfd->init_complete is true only after all kfd node got initialized at
>>>>>>> kgd2kfd_device_init. If kfd driver init fail kfd->init_complete would
>>>>>>> be false, then kfd_cleanup_node would not get called.
>>>>>>>
>>>>>> Hi Xiaogang,
>>>>>> It may get triggered on error handling path of ‘kid2kfd_device_init()`,
>>>>>> when it jump to label `node_alloc_error` and
>>>>>> then call `kfd_cleanup_nodes()`.
>>>>>>
>>>>> If it was the case kzalloc failed. That means there is no memory
>>>>> available even to allocate struct kfd_node. From the backtrace the
>>>>> general protection fault happened at
>>>>>
>>>>> RIP: 0010:kgd2kfd_device_exit+0x6/0x60 [amdgpu]
>>>>>
>>>>> It happened during driver module got released, not init time. I do not
>>>>> see how the patch is related to the issue you talked.
>>>> Aha, it’s caused by another bug which got fixed by "[PATCH 4/6] amdgpu:
>>>> fix use after free bug related to amdgpu_driver_release_kms()”.
>>>> Without the fix in "[PATCH 4/6] amdgpu: fix use after free bug related to
>>>> amdgpu_driver_release_kms()”, kgd2kfd_device_exit() will got called
>>>> twice in case of device probe failure. I caused me some time to figure out
>>>> the double free issue.
>>>> Actually we should reset kfd->init_completed to false in function
>>>> kgd2kfd_device_exit().
>>> We can set kfd->init_completed = false at end of kgd2kfd_device_exit, but
>>> how kgd2kfd_device_exit was called two times? is there another bug caused
>>> that?
>>>
>> I guess it caused by another bug related to the way amdgpu cooperates with
>> the amdgpu_xcp driver. It would be better to enhance amdgpu_xcp driver
>> either.
> kfd driver has considered which kfd nodes got initialized and release them
> accordingly. From what saw here seems you may mix different issues or not
> target the real issue. Let's have backtrace match the changes.
>
Sure, I will rework this patch with log message to address the possible
resource leakage.
> Regards
>
> Xiaogang
>
>>
>>> Regards
>>>
>>> Xiaogang
>>>
>>>
>>>>
>>>>> Regards
>>>>> Xiaogang
>>>>>
>>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Gerry
>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Xiaogang
>>>>>>>
>>>>>>>> device_queue_manager_uninit(knode->dqm);
>>>>>>>> kfd_interrupt_exit(knode);
>>>>>>>> kfd_topology_remove_device(knode);
>>>>>>>> @@ -954,7 +956,10 @@ void kgd2kfd_device_exit(struct kfd_dev *kfd)
>>>>>>>> kfd_doorbell_fini(kfd);
>>>>>>>> ida_destroy(&kfd->doorbell_ida);
>>>>>>>> kfd_gtt_sa_fini(kfd);
>>>>>>>> - amdgpu_amdkfd_free_gtt_mem(kfd->adev, &kfd->gtt_mem);
>>>>>>>> + if (kfd->gtt_mem) {
>>>>>>>> + amdgpu_amdkfd_free_gtt_mem(kfd->adev,
>>>>>>>> &kfd->gtt_mem);
>>>>>>>> + kfd->gtt_mem = NULL;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> kfree(kfd);
>>>>>>
>>>>
>>