Am 2021-08-12 um 11:49 a.m. schrieb Zhang, Yifan:
> [AMD Official Use Only]
>
> Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute
> call. There is time slot when Process vma has been unmapped from CPU side,
> but prange is not removed in deferred list. If get-attribute is called in
> this time slot, then there will be a problem.
>
> This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after
> KFDSVMRangeTest.BasicSystemMemTest test.
>
> TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
> TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
> TEST_START(TESTPROFILE_RUNALL)
> ......
> HSAint32 enable = -1;
> EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
> expectedDefaultResults[4] =
> (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
> sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the
> last test case since the addr is already munmap for CPU.
> char *pBuf = sysBuffer->As<char *>();
>
> LOG() << "Get default atrributes" << std::endl;
> memcpy(outputAttributes, inputAttributes, nAttributes *
> sizeof(HSA_SVM_ATTRIBUTE));
> EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the
> problematic get-attribute. It returns the the prange attributes in the last
> test case since prange is not removed.
> nAttributes, outputAttributes));
>
> sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should
> return the default attributes. But there are some corner cases, deferred list
> from last test case is not finished when get attribute is called, so svm
> range list remains the contents of the last test. In such cases,
> hsaKmtSVMGetAttr returns the attributes in the last test instead of default
> values. (The two test cases are in the same process, mmap in HsaSVMRange
> constructor return the same addr) and makes test case fail.
OK. Thanks for investigating further. I think your patch is the the
correct solution after all. I'd just add a carefully worded comment:
/* Flush pending deferred work to avoid racing with deferred actions from
* previous memory map changes (e.g. munmap). Concurrent memory map changes
* can still race with get_attr because we don't hold the mmap lock. But that
* would be a race condition in the application anyway, and undefined
* behaviour is acceptable in that case.
*/
With that, the patch is
Reviewed-by: Felix Kuehling <[email protected]>
Regards,
Felix
>
>
> -----Original Message-----
> From: Kuehling, Felix <[email protected]>
> Sent: Tuesday, August 10, 2021 11:57 PM
> To: Zhang, Yifan <[email protected]>; [email protected]
> Subject: Re: [PATCH] drm/amdkfd: fix random
> KFDSVMRangeTest.SetGetAttributesTest test failure
>
> Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
>> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>>
>> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18
>> tests from 1 test case.
>> [----------] Global test environment set-up.
>> [----------] 18 tests from KFDSVMRangeTest
>> [ RUN ] KFDSVMRangeTest.BasicSystemMemTest
>> [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
>> [ RUN ] KFDSVMRangeTest.SetGetAttributesTest
>> [ ] Get default atrributes
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
>> : Failure Value of: expectedDefaultResults[i]
>> Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
>> : Failure Value of: expectedDefaultResults[i]
>> Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152
>> : Failure Value of: expectedDefaultResults[i]
>> Actual: 4
>> Expected: outputAttributes[i].type
>> Which is: 2
>> [ ] Setting/Getting atrributes
>> [ FAILED ]
>>
>> the root cause is that svm work queue has not finished when
>> svm_range_get_attr is called, thus some garbage svm interval tree data
>> make svm_range_get_attr get wrong result. Flush work queue before iterate
>> svm interval tree.
>>
>> Signed-off-by: Yifan Zhang <[email protected]>
>> ---
>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index f811a3a24cd2..192e9401bed5 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t
>> start, uint64_t size,
>> pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>> start + size - 1, nattr);
>>
>> + /* flush pending deferred work */
>> + flush_work(&p->svms.deferred_list_work);
>> +
> There is still a race condition here. More work can be added to the
> deferred_list_work after the flush call.
>
> Work gets added to the deferred_list asynchronously, for example in MMU
> notifiers. Trying to synchronize with asynchronous events is inherently
> problematic. It appears that the test is making some assumptions about things
> happening asynchronously (page faults or MMU notifiers) and that's probably a
> problem with the test, not with the driver.
>
> Alternatively, there may be a problem with a set-attribute call that leaves
> some operations on the deferred list and results in unexpected get-attribute
> results. If that's the problem, we may need to add a flush-call to the end of
> the set-attributes function.
>
> Can you provide more details about the exact sequence of set-attribute and
> get-attribute calls that is causing the problem?
>
> Regards,
> Felix
>
>
>> mmap_read_lock(mm);
>> r = svm_range_is_valid(p, start, size);
>> mmap_read_unlock(mm);