RE: [PATCH] drm/amdgpu: Add more process info in VM for debug

Huang, Trigger Mon, 17 Dec 2018 00:27:24 -0800

Hi Christian,

Yes, if the test machine is still there for debugging, we can login it and 
check a lot of things, such as ' ps -p 1 -o args ' as you suggested.


But sometimes, the system is not alive anymore, and we only got some log files 
(such as kern.log ) from QA or customers. 
And at this time, the full command line information in dmesg is quite useful. 
Let's take the following message for example:
        
        [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring 
comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572
        [ 1740.050167] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's 
process information is as below:
        [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process SDMA 
pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode goldimage_compare 
--offscreen --n-swapchain-images 3 --gpu 0 --frontend test_executor 
--n-test-threads 4
When customer/QA reported that there is comp ring job timeout when running a 
big test case, vk-example, but the machine is not alive anymore, and can't  
login for debugging. Then we need to re-run the whole vk-example to reproduce 
this issue, and this will waste a lot of time.
But if we get the specific sub-base in the  kern.log file when job timeout 
happened, then we can only try the specific one, here it is ' SDMA --mode 
goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0 --frontend 
test_executor --n-test-threads 4'
maybe several minutes later, this issues is reproduced.


Thanks,
Trigger.


-----Original Message-----
From: amd-gfx <[email protected]> On Behalf Of Koenig, 
Christian
Sent: Monday, December 17, 2018 3:50 PM
To: Huang, Trigger <[email protected]>; [email protected]
Cc: Grodzovsky, Andrey <[email protected]>; Qu, Jim <[email protected]>
Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug

Hi Trigger,

> Does this make sense?
Sorry I wasn't clear enough. The key point that we don't want/need the full 
commandline of the process here is that we can already get that from the 
information we have.

E.g. "ps -p 1 -o args" gives you the command line of the process 1.

The only case where directly printing this into the logs is useful is when we 
run into a total system crash and in this case the processed is only the 
trigger, but not the root cause.

Regards,
Christian.

Am 17.12.18 um 04:03 schrieb Huang, Trigger:
> Hi Christian,
>
> Many thanks for pointing out the mistakes
>
> I have some comments as below, would you help to check again?
>
> First of all you can't get the process info during VM creating since that can 
> happen in X as well.
> [Trigger]: Ok, I will keep the original logic, which is that set the vm info 
> in cs . I will still invoke kfree(cmd_line) in amdgpu_vm_fini to avoid memory 
> leak.
>
> Second when a timeout happen the VM structure might already be released, so 
> using job->vm is illegal here. What we could try is to get the VM using the 
> PASID.
> [Trigger]: Ok, I will do it in job timeout like what VMC page fault's 
> handler does
>
> And last we don't want to keep the full command line around.
> [Trigger]: well, actually, the detailed command line is just what we want.
> For example, there are thousands of sub-cases of one big test case, and for 
> each sub-case, the arguments may also be different.
> In some corner case,  test machine is hung after running the big test case 
> for several hours even several days, it is really painful to wait another 
> several hours to reproduce it and debug.
> But if we know the last sub-case running on the test machine, then 
> this issues *may* can be reproduced by only running the specific sub-case 
> with specific arguments for several rounds, and in this situation, it will 
> save us a lot time for reproducing and debugging.
> Does this make sense?  If not, how about we add a parameter, such as 
> amdgpu_vm_debug_verbose, to turn on/off the cmd line dumping?
>
>
> Thanks,
> Trigger
>
> -----Original Message-----
> From: Christian König <[email protected]>
> Sent: Saturday, December 15, 2018 8:23 PM
> To: Huang, Trigger <[email protected]>; 
> [email protected]
> Cc: Grodzovsky, Andrey <[email protected]>; Qu, Jim 
> <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Add more process info in VM for debug
>
> Am 15.12.18 um 09:56 schrieb Trigger Huang:
>> When debugging VMC page fault and ring hang issues, the detailed 
>> process information is quite helpful, especially when the issue can 
>> only be reproduced after a very long time running. With this 
>> information, only run the specific sub-testcase may also will 
>> reproduce the issue, which may save a lot of time for debugging.
>>
>> With this patch, the process information is similar as following.
>>      When VMC page fault issue happened:
>> [  142.978417] amdgpu 0000:00:08.0: [gfxhub] VMC page fault (src_id:0
>> ring:171 vmid:2 pasid:32769), [  142.978542] amdgpu 0000:00:08.0: for 
>> process ocltst pid 1354 thread ocltst pid 1354, args:./ocltst -m oclperf.so 
>> -t OCLPerfDeviceEnqueueEvent,
>> [  142.978693] amdgpu 0000:00:08.0:   in page starting at address 
>> 0x0000000000000000 from 27
>>
>>      When ring hang issue happened:
>> [ 1740.047122] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
>> comp_1.0.0 timeout, signaled seq=91571, emitted seq=91572 [ 1740.050167] 
>> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* The job's process information is 
>> as below:
>> [ 1740.053160] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process 
>> SDMA pid 2098 thread SDMA pid 2098, cmd line:SDMA --mode 
>> goldimage_compare --offscreen --n-swapchain-images 3 --gpu 0 
>> --frontend test_executor --n-test-threads 4
>>
>> Signed-off-by: Trigger Huang <[email protected]>
> Well NAK, we intentionally didn't do it this way.
>
> First of all you can't get the process info during VM creating since that can 
> happen in X as well.
>
> Second when a timeout happen the VM structure might already be released, so 
> using job->vm is illegal here. What we could try is to get the VM using the 
> PASID.
>
> And last we don't want to keep the full command line around.
>
> The only valid addition I can see here is to print the thread info when the 
> timeout happens.
>
> Regards,
> Christian.
>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c  |  3 ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c |  8 ++++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c  | 11 +++++++++++
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h  |  1 +
>>    drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c   | 11 ++++++-----
>>    5 files changed, 26 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index 1c49b82..1a2d0c9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -235,9 +235,6 @@ static int amdgpu_cs_parser_init(struct amdgpu_cs_parser 
>> *p, union drm_amdgpu_cs
>>              p->job->uf_addr = uf_offset;
>>      kfree(chunk_array);
>>    
>> -    /* Use this opportunity to fill in task info for the vm */
>> -    amdgpu_vm_set_task_info(vm);
>> -
>>      return 0;
>>    
>>    free_all_kdata:
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index e0af44f..c75ecb3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -43,6 +43,14 @@ static void amdgpu_job_timedout(struct drm_sched_job 
>> *s_job)
>>                job->base.sched->name, atomic_read(&ring->fence_drv.last_seq),
>>                ring->fence_drv.sync_seq);
>>    
>> +    if (job->vm) {
>> +            struct amdgpu_task_info *ti = &(job->vm->task_info);
>> +
>> +            DRM_ERROR("The job's process information is as below:\n");
>> +            DRM_ERROR("Process %s, thread %s, cmd line:%s\n",
>> +                    ti->process_name, ti->task_name, ti->cmd_line);
>> +    }
>> +
>>      if (amdgpu_device_should_recover_gpu(ring->adev))
>>              amdgpu_device_gpu_recover(ring->adev, job);
>>    }
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index e73d152..24f3cbd 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -30,6 +30,7 @@
>>    #include <linux/idr.h>
>>    #include <drm/drmP.h>
>>    #include <drm/amdgpu_drm.h>
>> +#include <linux/string_helpers.h>
>>    #include "amdgpu.h"
>>    #include "amdgpu_trace.h"
>>    #include "amdgpu_amdkfd.h"
>> @@ -3045,6 +3046,8 @@ int amdgpu_vm_init(struct amdgpu_device *adev, struct 
>> amdgpu_vm *vm,
>>                      goto error_free_root;
>>    
>>              vm->pasid = pasid;
>> +
>> +            amdgpu_vm_set_task_info(vm);
>>      }
>>    
>>      vm->fault_hash = init_fault_hash(); @@ -3223,6 +3226,9 @@ void 
>> amdgpu_vm_fini(struct amdgpu_device *adev, struct amdgpu_vm *vm)
>>              spin_lock_irqsave(&adev->vm_manager.pasid_lock, flags);
>>              idr_remove(&adev->vm_manager.pasid_idr, vm->pasid);
>>              spin_unlock_irqrestore(&adev->vm_manager.pasid_lock, flags);
>> +
>> +            kfree(vm->task_info.cmd_line);
>> +            vm->task_info.cmd_line = NULL;
>>      }
>>    
>>      kfree(vm->fault_hash);
>> @@ -3391,6 +3397,11 @@ void amdgpu_vm_set_task_info(struct amdgpu_vm *vm)
>>                      vm->task_info.tgid = current->group_leader->pid;
>>                      get_task_comm(vm->task_info.process_name, 
>> current->group_leader);
>>              }
>> +
>> +            vm->task_info.cmd_line =
>> +                            kstrdup_quotable_cmdline(current, GFP_KERNEL);
>> +            if (!vm->task_info.cmd_line)
>> +                    DRM_DEBUG_DRIVER("Failed to get cmdline!\n");
>>      }
>>    }
>>    
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index e8dcfd5..9fab787 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -175,6 +175,7 @@ struct amdgpu_vm_pte_funcs {
>>    struct amdgpu_task_info {
>>      char    process_name[TASK_COMM_LEN];
>>      char    task_name[TASK_COMM_LEN];
>> +    char    *cmd_line;
>>      pid_t   pid;
>>      pid_t   tgid;
>>    };
>> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> index bacdaef..c3e3558 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
>> @@ -325,11 +325,12 @@ static int gmc_v9_0_process_interrupt(struct 
>> amdgpu_device *adev,
>>              amdgpu_vm_get_task_info(adev, entry->pasid, &task_info);
>>    
>>              dev_err(adev->dev,
>> -                    "[%s] VMC page fault (src_id:%u ring:%u vmid:%u 
>> pasid:%u, for process %s pid %d thread %s pid %d)\n",
>> -                    entry->vmid_src ? "mmhub" : "gfxhub",
>> -                    entry->src_id, entry->ring_id, entry->vmid,
>> -                    entry->pasid, task_info.process_name, task_info.tgid,
>> -                    task_info.task_name, task_info.pid);
>> +                    "[%s] VMC page fault (src_id:%u ring:%u vmid:%u 
>> pasid:%u), ",
>> +                    entry->vmid_src ? "mmhub" : "gfxhub",  entry->src_id,
>> +                    entry->ring_id, entry->vmid, entry->pasid);
>> +            dev_err(adev->dev, "for process %s pid %d thread %s pid %d, 
>> args:%s,",
>> +                    task_info.process_name, task_info.tgid,
>> +                    task_info.task_name, task_info.pid, task_info.cmd_line);
>>              dev_err(adev->dev, "  in page starting at address 0x%016llx 
>> from %d\n",
>>                      addr, entry->client_id);
>>              if (!amdgpu_sriov_vf(adev))

_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

RE: [PATCH] drm/amdgpu: Add more process info in VM for debug

Reply via email to